Full-Text Search Concepts

There are four major aspects of supporting full-text retrieval of plain-text data on a database:

Managing the definition of the tables and columns that are registered for full-text searches.
Indexing the data in registered columns. The indexing process scans the character streams, determines the word boundaries (this is called word breaking), removes all noise words (this is called stop words), and then populates a full-text index with the remaining words.
Issuing queries against registered columns for populated full-text indexes.
Ensuring that subsequent changes to the data in registered columns gets propagated to the index engine to keep the full-text indexes synchronized.

The underlying design principle for the indexing, querying, and synchronizing processes is the presence of a full-text unique key column (or single-column primary key) on all tables registered for full-text searches. The full-text index contains an entry for the non noise words in each row together with the value of the key column for each row.

When processing a full-text search, the search engine returns to SQL Server the key values of the rows that match the search criteria. The following example illustrates what is happening.

Suppose you have a SciFi table with the following columns, where the Book_No column is the primary key column.

Book No	Writer	Title
A025	Asimov	Foundation’s Edge
A027	Asimov	Foundation and Empire
C011	Clarke	Childhood’s End
V109	Verne	Mysterious Island

Next, suppose you have a full-text retrieval query where you want to find the book titles having the word “Foundation” in them. When the SQL Server relational engine encounters a full-text retrieval predicate, it calls the full-text search component to retrieve the values of Book_No that satisfy the text retrieval filter condition. In this case, the values A025 and A027 are returned. The relational engine then uses this information, together with other information under its direct control, to respond to the query.

Unlike classic relational database indexes, traditional full-text indexes are not modified instantly when values in full-text registered columns are updated, when rows are added to full-text registered tables, or when rows are deleted from full-text registered tables. Rather, full-text indexes usually are repopulated asynchronously. There are two reasons for this:

It typically takes significantly more time to update a full-text index than a classic index.
Full-text searches usually are fuzzy by nature and so do not need to be as precise as classic searches.

During repopulation, the unique key column values are passed to the index engine to identify those items that need to be reindexed. For example, if the title associated with V109 gets changed to “Mystery Island,” then the index should be modified to reflect this new value.

The full-text administration process starts by designating a table and its columns of interest for full-text search. Either GUIs and wizards or built-in stored procedures are used first to register tables and columns as eligible for full-text search. Then, a separate request (again by means of a GUI and wizards or stored procedures) is issued to populate the full-text indexes.

The result is that the underlying index engine gets invoked and asynchronous index population begins. Full-text indexing tracks which significant words are used and where they are located. For example, a full-text index might indicate that the word “Microsoft” is found at word number 423 and word number 982 in the Abstract column of the DevTools table for the row associated with a ProductID of 6. This index structure supports an efficient search for all items containing indexed words as well as advanced search operations, such as phrase searches and proximity searches. (An example of a phrase search is looking for “white elephant,” where “white” is followed by “elephant.” An example of a proximity search is looking for “big” and “house” where “big” occurs near “house.”)

To prevent the full-text index from becoming bloated, noise words such as “a,” “and,” and “the” are ignored. Noise word lists for many languages are available in the directory \Mssql\Ftdata\Sqlserver\Config. And the set of supported languages is growing. The choice of a particular noise word list is based on the language setting of the database server. These noise word lists should be sufficient for most operations, but they can be modified by using a regular text editor. For example, a computer company can add the word “computer” to its noise word list.

Extensions to Microsoft SQL Server Transact-SQL were made to allow users to pose full-text queries. The syntax of the CONTAINS and FREETEXT predicates, already supported for full-text search in both the Microsoft OLE DB Provider for Indexing Services 2.0 and the Microsoft OLE DB Provider for Site Server 3.0 Search, were used.

The CONTAINS predicate is used to search for:

A word or phrase.
The prefix of a word or phrase.
A word or phrase that is near another.
A word that is an inflectional form of another (for example, “drive” is the inflectional stem of “drives,” “drove,” “driving,” and “driven”).
A set of words or phrases, each of which is assigned a different weighting.

The FREETEXT predicate is a basic form of natural language query. Any text, including words, phrases, or sentences, can be specified in the query. The search engine matches values that reflect the meaning, rather than the exact wording, of the query.

The SQL Server relational engine recognizes the CONTAINS and FREETEXT predicates and performs some minimal syntax and semantic checking, such as ensuring that the column referenced in the predicate has been registered for full-text searches. During query execution, a full-text predicate and other relevant information are passed to the full-text search component. After further syntax and semantic validation, the search engine is invoked and returns the set of unique key values identifying those rows in the table that satisfy the full-text search condition.

There currently are two mechanisms available to an administrator for keeping a full-text index synchronized with changes to the data in its table:

Scheduled repopulation
On-demand repopulation

Both GUIs and stored procedures are provided for repopulation tasks. If a table has a column with a timestamp data type, it is possible to request an incremental repopulation where only changes to the table since the last population participate in the repopulation. If a table does not have a timestamp column, then only a complete repopulation is possible.

An alternative to automatic update of full-text indexes based on log activity is a candidate for a future release of SQL Server.