Chapter 23: Textual Searches on File Data

Microsoft SQL Server version 7.0 introduces facilities that support textual queries on data in SQL Server as well as on data in the file system. This chapter describes searches on data in the file system. Several products and features have been brought together to support this capability, including SQL Server distributed queries, Windows NT Server built-in Web server, Microsoft Internet Information Services (IIS) version 4.0, and Microsoft Index Services version 2.0. This chapter is for those who are familiar with SQL Server but not necessarily familiar with its textual search features and products.

This chapter introduces the types of textual searches supported by SQL Server and illustrates the roles performed by both the IIS and Index Services technologies. It also introduces SQL Server distributed queries and describes their use in the processing of textual searches. It then addresses the question of how SQL queries can be written against the file system when there are no tables in the file system. Finally, the SQL extensions to support full-text queries are described, and several examples, including examples that combine file data with database data, are provided.

This chapter provides an overview of how to incorporate file data into SQL queries and of how the various components of SQL Server and other software interact to provide support for such queries.

A large portion of digitally stored information is in the form of unstructured data, primarily text, which is stored in the file system. This data is often related to data within the database, and there are requirements to support searches that include both sources. However, it is often inappropriate to import this data into the file system. SQL Server 7.0 distributed queries, coupled with extensions to the SQL language, make it possible to write such queries without the data. This capability is called file content search.

There are two major types of textual searches:

Property
This search technology first applies filters to documents to extract properties such as author, subject, type, word count, printed page count, and time last written, and then issues searches against those properties.
Full-text
This search technology first creates indexes of all non-noise words in the documents, and then uses these indexes to support linguistic searches and proximity searches (searches for words or phrases that are "close" to each other).

File content search supports both these types of textual searches and couples them with the ability to incorporate such searches into a query that includes relational operations against database data. For example, the following search selects the names, sizes, and authors of all Microsoft Word files on drive D that contain the phrase "SQL Server" in proximity to the word "text." It then joins this result with the writers table to obtain the author's citizenship.

SELECT Q.FileName, Q.Size, Q.DocAuthor, W.Citizenship 

FROM OpenQuery(MyLinkedServer, 

               'SELECT FileName, Size, DocAuthor

                FROM SCOPE('' "D:\" '')

                WHERE CONTAINS(''"SQL Server"

                                 NEAR() text'')

                AND FileName LIKE ''%.doc%'' '

              ) AS Q,

     writers AS W

WHERE Q.DocAuthor = W.writer_name

  

File content search relies on the Microsoft OLE DB Provider for Index Services. It also relies on Index Services for support of underlying filters and full-text indexes.

Notice that the OLE DB Provider gives Index Services 2.0 the ability to support SQL queries against data in the file system independent of SQL Server. The core extensions SQL that support such queries are the same in Index Services and SQL Server.

This chapter has the following purposes:

Only a small percentage of the facilities available in the supporting products are introduced in this chapter, and even when a facility is discussed, many of its options are not discussed. For more information, see SQL Server Books Online and your Windows NT documentation.