Microsoft Internet Information Server (IIS) 4.0 and Index Services 2.0 (both part of the Microsoft Windows NT 4.0 Option Pack) combine to provide property filtering and searching as well as full-text indexing and searching of file data. Windows NT 4.01 Service Pack 4 must be installed for proper interaction between Index Services and SQL Server.
All of these capabilities are available completely independent of SQL Server. In particular, there are at least two ways to search that do not use SQL Server. One of these employs an Index Services specific query language; the other supports SQL-based queries within ActiveX Data Objects (ADO). Neither alternative is discussed in this chapter except to say that SQL used in ADO queries is consistent with the SQL extensions outlined here. This chapter discusses property filtering and full-text indexing.
Index Services provides filters for several file formats, including Microsoft Word, Microsoft PowerPoint, Microsoft Excel, and HTML. Filters are also available for plain–text documents. Filters can be written by customers and third-party vendors for other formats such as Adobe Acrobat. Filters provide support for non-plain–text documents and capture property values both from the file content and about the files. Assuming that every file is a document, examples of properties include the document’s title, the number of its pages with notes (for PowerPoint documents), the number of paragraphs it contains, the date and time when it was last accessed, and its physical path. A full list of file properties is provided later in this chapter. For more information, see your Windows NT documentation.
Full-text indexes are created by scanning file content. The process consists of tracking which significant words are used and where they are located. For example, a full-text index might indicate that the word “Canada” is found at word number 227, word number 473, and word number 1,017 in a given file. This index structure supports an efficient search for all items containing indexed words, as well as advanced search operations such as phrase searches and proximity searches. An example of a phrase search is looking for “white elephant,” where “white” is immediately followed by “elephant.” An example of a proximity search is looking for “big,” where “big” occurs near “house.”
To prevent the full-text index from becoming bloated, noise words (words that are too common to expedite the search, such as “a,” “and,” “the,” and “therefore”) are ignored. SQL Server setup for version 7.0 installs noise-word lists for many languages in directory \Mssql\Ftdata\Sqlserver\Config. And the set of supported languages is growing. The choice of a particular noise-word list is based on the language of the material that is file-format dependent during the filtering process. Some files set the language per section or paragraph; some specify it for the entire document. These noise-word lists should be sufficient for most operations, but they can be modified.
Index Services stores indexes and property values in a text search catalog. By default, a text search catalog named Web is created when Index Services is installed. It is possible to specify more than one text search catalog, but this chapter confines itself to the use of Web and does not discuss the process used to create additional text search catalogs.
A given text search catalog references one or more IIS virtual directories. A virtual directory references one or more physical directories and, optionally, other virtual directories. When a real file is linked to the text search catalog by means of a virtual directory, Index Services is notified of any new files that need to be indexed and begins filtering and indexing the properties and content associated with those files. Index Services is also notified of any subsequent changes to the files and will refilter and reindex any updated files.
The following screen shot shows a directory for the default Web site on the computer Pellow-2. Notice that /Corpus is listed in both panes. /Corpus is the alias of a virtual directory that, in turn, points to the real directory, D:\Corpus. All of the files in D:\Corpus have their properties and full-text indexes maintained in the Web text search catalog.
The following screen shot demonstrates how the Virtual Directory Wizard can be invoked to insert new virtual directories into the tree.
This final screen shot demonstrates the result of using the Virtual Directory Wizard to add the /SQL_standards virtual directory, which contains two virtual directories and one real directory.