Anatomy of a Search Solution

Krishna Nareddy
Windows NT Query Team
Microsoft Corporation

July 29, 1997

Introduction

The explosion of the Internet and intranets has led to an enormous growth in the amount of textual information available to the masses. Any World Wide Web site with textual content needs a search solution to help its visitors find what they want with the least amount of effort. A wide array of search solutions, ranging from the home-grown pattern matchers to specialized high-volume concept-based search engines, have invaded the Web to offer a wide range of solutions.

Empowering your site with a search solution is as simple as spending a few hours to download and set up your free copy of Microsoft® Index Server. Before you realize it, you are the proud owner of a Web page brandishing your logo and a dialog box waiting to serve your visitors. But wait a minute! Are you making the most of it? Is that configured to be the right solution for your needs? How secure is your corpus? What if you need to scale up? What if you need to customize your solution to meet the needs of your diverse users? Heavy-duty search solutions such as the Index Server and Microsoft Site Server Search (scheduled for release towards the end of 1997) are an intricate interplay of several components engineered to seamlessly work for you. Understanding as much as you can helps you mold the solutions to address your needs.

This is the first of a series of articles aimed at helping you understand and effectively deploy Microsoft’s search solutions on your Web sites and intranets. This article is designed to help you identify the various components and to enumerate potential features. That will help you convert your search need into a series of feature requests. The next article in this series will describe Microsoft Index Server 2.0 (scheduled for release in October 1997) and Microsoft Site Server Search against the framework provided in this article.

Pieces of the Puzzle

Let’s start with what you have—a bunch of documents in several file formats, structured in several ways, and possibly written in several languages. This is your data. To distinguish you from those who did not read this article, call it a corpus. Your corpus is spread out on your server(s), possibly across your intranet. You need an agent to feed individual documents to the assembly line. For lack of a better term, let’s call it a (document) gatherer. Each document has its own format, so you need a (document) filter to get rid of the unnecessary bits and extract the real content. The result is a stream of characters that needs to be run through a word breaker to extract the sequence of words to index.

What should the search engine do with these words? Whatever it takes to resolve your queries as quickly as possible. The words and related information, such as the relative position of words and their frequency of occurrence, is compiled into one or more persistent structures collectively called the index. The module that handles this task is the indexer.

How can your users express their information need and retrieve what they want? You need a query language and a user interface to help them compose queries. Queries are transmitted to the server, where a query processor uses the index to resolve queries and arrives at a set of matches. The list of matching documents, the hit list, is presented to the user, who uses a (document) browser to peruse any of those documents, preferably in their original form.

Any system you deploy should be easy to administer and configure. Hardware and network changes, disk corruption and capacity issues, reliable operation, round-the-clock availability, and a variety of ad hoc administration issues crop up on a regular basis. The solution you deploy should be designed to address these issues with as little human intervention as possible. Different pieces have different issues. We will address them as appropriate when we detail the pieces.

Description of the Pieces

The pieces are in place. Let’s look at each one of them in detail. When you are done digesting this section, you should be able to examine your search needs and identify how to define each piece of your solution.

Where Does the Web Fit In?

We are discussing search solutions in the context of the Web, which is the big blob gluing all the pieces of the puzzle. The Web provides a standard client/server infrastructure and this article assumes that you have some familiarity with that infrastructure. We will not discuss any part of this infrastructure, except when it is directly related to the implementation of a search solution.

The Document Corpus

What type of documents do you need to be able to search? The search solution you deploy should be able to handle your data. The following is a list of features to help you examine various aspects of your corpus.

Gathering Documents

Depending on the location of your data, you need to employ different techniques to channel your data to the indexer. Most Web-based solutions can pull all the documents served by a single, designated Web server. Others provide a Web crawler to gather documents from the Web. Some can work with your file system. Others provide specialized solutions to pull documents from proprietary stores such as the Microsoft Exchange Server.

Filtering Documents

You have various file formats and you want your search solution to extract the content that matters to you. Most documents have more than content; they also have properties such as author and creation date. You want to extract all those properties that matter to you and index them. Document filters understand specific file formats and channel your content and properties to the indexer. Preferably, your solution should be able to handle a variety of common file formats right “out of the box.” You should be able to acquire or develop filters for the rest of the file formats and be able to communicate with the indexer through a standard interface.

Recognizing Features of Your Content

Your documents are more than just a stream of text. They contain syntactic units such as words and sentences. Beyond that, they have some "meaning." The task of a search engine is to "understand" user's queries and find documents that best match those queries. The first step in that direction is an attempt to understand the content.

At the very least, individual words should be accurately recognized by the word breaker. The next level is to be able to extract conceptual features such as noun phrases (for example, "United Nations") and stems (for example, "swim" is the stem of "swimming," "swam," "swimmer," and so on). Beyond that, natural language processing has not matured enough to be of much use in a general-purpose search solution. Feature recognition is language-specific; so make sure the languages you care about are supported.

Documents generally contain many very commonly used words that are of little use in distinguishing one document from another. General examples are words such as “a,” “an,” and “the.” Certain words in a domain are used too frequently to be of much use in distinguishing documents from one another—for example, the word “From” in a corpus of mail messages. Eliminating such noise words serves the dual purpose of improving computational efficiency and improving the quality of documents returned in response to a query. You should be the judge of what words are noise words in your corpus.

Indexing Documents

An index is an efficient organization of the extracted features. Its purpose is to provide an efficient lookup at query time. You expect your indexing service to be reliable, to operate round-the-clock, to be amenable to easy configuration and administration, and to meet your performance requirements. These are essential features of any service you install on your servers. Therefore, this section will dwell only on features unique to the domain of search solutions.

Your users may not need all the features provided by the search solution. For example, they may never need to view or query the document modification time. Or they may never need to sort hit lists on a given property. If you need to provide only a select subset of search features, wouldn’t it be efficient to have the indexer work to provide only those features?

The Query Language

A query language is the most direct link between the end user and the search solution. It should be sufficiently expressive to allow users to convey a wide variety of information needs. It should be intuitive enough to allow your least experienced user to frame the right queries without feeling overwhelmed by the syntax and semantics of the language. And it should allow your users to frame precise queries.

Query languages are generally closer to computer languages than they are to human languages. So you may often find yourself having to build layers between the query processor’s native language and your users. A well-designed query language should facilitate easy translation between your user’s needs and its syntactic idiosyncrasies.

Understanding different types of information needs and how they translate to queries will help you map your user needs to the query language features provided by the search solution. Because of the range of possible information needs, the following list is not exhaustive.

The query language should be flexible enough to express these needs. It is easy to achieve high recall. You just have to throw a lot of words and phrases into your query. It doesn’t help if all your relevant documents are buried in a sea of irrelevant ones. Therefore, you should strive to attain a high level of precision. You need to know what you want and the query language should allow you to express what you know. For example, if you know that your target documents should contain the phrases “mortgage broker” and “real estate,” but that the first phrase is more important than the second one, you should be able to express that in your query language to improve your precision.

The Query Processor

Like the indexer, the query processor should be reliable, operate round-the-clock, be amenable to easy configuration and administration, and be able to meet your performance requirements. This section will dwell only on features unique to the domain of query processors.

The Hit List

The hit list contains the documents matching the query. It should contain all the information needed to help the user determine whether a document holds any promise of being relevant. Once that decision is made, the hit list usually serves as a launch pad for the document browser. Note that a hit list is not necessarily a list in the strict sense of the word, although that is the most commonly used format. It is any user interface (UI) that directly or indirectly aids the user to walk through the available list of matches. The following is a list of features you might expect from such a UI.

Browsing the Documents

Whenever possible, documents should be perused using viewers that can render a document with full fidelity. The nontextual context and the layout could provide important clues that help the reader quickly judge its relevance or make the best use of the information presented in the document.

Microsoft’s Web-based Search Solutions

Now that you have an understanding of the anatomy of a search solution, you can analyze your search needs and determine how well a search solution can stack up against your needs. Your understanding will also help you make the most of the solutions you deploy. Future articles in this series will provide detailed overviews of Microsoft’s search solutions. The overview articles will be followed by in-depth technical articles to help administrators, solution providers, and developers to make the most of the features provided by these products.

Microsoft’s full-text search solutions such as Index Server, Site Server, and Indexing Service may incorporate some variants of the features discussed in this article. Readers are advised not to assume support for the above features in full-text search solutions provided by Microsoft or other vendors. Please consult product documentation or product support to determine availability of a feature or the feasibility of enhancing your installation to support that feature.