Defined Tables

At least one defined table must be specified in every SQL query. A defined table is a table in which the number and types of columns are either known in advance or specified as part of the query. A relational database usually contains a number of defined tables, and metadata about the columns of these tables is stored in a schema.

The collection of files in a file system does not generally have a predefined structure. The closest thing to columns are the properties of a file, but there is no deterministic set of properties for files. The closest thing to a row is a file, but files are usually not grouped in a homogeneous collection akin to rows in a table. Thus, in this case, the table concept is unclear, SELECT * is meaningless, and both the rows and columns are unbounded. Another way of looking at this is that a file system effectively has a universal schema consisting of every possible file property, both known and unknown.

Index Services solves this problem by providing the SCOPE function as a means of defining the set of rows that makes up a virtual table and by providing file properties that substitute for columns.

SCOPE Function

The SCOPE function is specified in the FROM clause of the Index Services query. It specifies the set of files that make up a virtual table.

The syntax of the SCOPE function, simplified for this chapter, is as follows:

These are the definitions of the syntax terminology:

()
The virtual table consists of all the files that have been registered in the text search catalog and data source for the linked server specified in the OPENQUERY() function.
DEEP TRAVERSAL OF
The virtual table consists of all the files in the directory at the specified path or virtual directory as well as all the files in all the subdirectories (to any level) that are considered to be part of the virtual table. DEEP is the default.
SHALLOW TRAVERSAL OF
The virtual table consists only of the files in the top-level directory at the specified path or virtual directory that are considered to be part of the virtual table.
physical_path
A path to a real directory. If a real directory is specified, the filtering and indexing is done as part of the query processing, which can be time-consuming.
virtual_directory
The alias (or chain of aliases) assigned to a virtual directory that has been registered in the text search catalog and data source for the linked server specified in the OPENQUERY() function. In this case, the filtering and indexing will probably already have been done and, thus, the query will be much faster than when a physical path is specified.

File Properties

Index Services filters and maintains an excess of 50 file properties. All of these can be specified in text file search queries. From the perspective of writing a SELECT statement, there are three types of file properties:

This table outlines some of the file properties.


Property name

SQL Data Type

Description
Use in ORDER BY clause
Use in select list
Access datetime Most recent date and time that the file was accessed. Yes Yes
Characterization nvarchar or ntext Abstract of the contents of the file. In Index Services 2.0, this is usually the first paragraph or first section of a document. In future releases, it is planned to be a real summary. Yes
Contents nvarchar or ntext Main contents of the file.
Create datetime Date and time that the file was created. Yes Yes
Directory nvarchar Physical path to the file, not including the file name. Yes Yes
DocAuthor nvarchar Document author. Yes Yes
DocComments nvarchar Comments about the document. Yes Yes
DocLastAuthor nvarchar Most recent user that edited the document. Yes Yes
DocLastPrinted datetime Date and time that the document was last printed. Yes
DocPageCount integer Number of pages in the document. Yes
DocPartTitles array of varchar Names of the document parts:
· in Microsoft PowerPoint (slide titles).
· in Microsoft Excel (spreadsheets).
· in Microsoft Word (documents).
DocSubject nvarchar Subject of the document. Yes Yes
DocTitle nvarchar Title of the document. Yes Yes
DocWordCount integer Number of words in the document. Yes
FileIndex decimal (19,0) Unique identifier of the file. Yes Yes
FileName nvarchar Name of the file. Yes Yes
HitCount integer Number of words matching the query. Yes Yes
Path nvarchar Full physical path to the file, including the file name. Yes Yes
Rank integer Value from 0 through 1,000, indicating how well this row matches the selection criteria. Yes Yes
Size decimal (19, 0) Size of the file (in bytes). Yes Yes
Write datetime Most recent date and time that the file was written. Yes Yes

Customers and third-party vendors can write filters to add to this set of file properties. They can also add properties, for example, by adding tags to an HTML document. In addition, to permit the query and retrieval of such user-defined file properties, the SQL extensions to Index Services include support for a SET statement that allows the specification of new file property names and their associated types.

Virtual Tables

It is possible to specify a query with the equivalent of a table in the file system, resulting in the select list and the FROM clause. For other parts of a SELECT statement, the properties can be used in place of columns in the WHERE and ORDER BY clauses. However, the GROUP BY and HAVING clauses are not supported by OLE DB Provider for Index Services. The following examples illustrate the use of all supported clauses.

The following query selects the full physical path and the file creation timestamp of all files in the /SQL-standards virtual directory and all its subdirectories, where the document contains the phrase “overloaded function.”

SELECT * 

FROM OpenQuery(FileSystem, 

               'SELECT Path, Create

                FROM SCOPE('' "/SQL-standards" '')

                WHERE CONTAINS(Contents, '' "overloaded function" '') 

               '

              )  

  

The next query is similar to the previous one, except that only files directly in the /SQL-standards virtual directory are considered.

SELECT * 

FROM OpenQuery(FileSystem, 

               'SELECT Path, Create

                FROM SCOPE('' SHALLOW TRAVERSAL OF "/SQL-standards" '')

                WHERE CONTAINS(Contents, '' "overloaded function" '')   

               ' 

              )  

  

The next query is also similar, except that only files directly in the /SQL3 virtual subdirectory are considered.

SELECT * 
FROM OpenQuery(FileSystem,

               'SELECT Path, Create

                FROM SCOPE('' "/SQL-standards/SQL3" '')

                WHERE CONTAINS(Contents, '' "overloaded function" '')  

               ' 

              )  

  

The final query selects author, title, subject, and file name of documents in all files that are either in the /Corpus virtual directory and its subdirectories or in the \Temp directory on drive C, where the document is at least 5,000 words, the author is either Wendy Vasse or Anas Abbar, and the rows representing those documents with the most pages are ordered highest.

SELECT * 

FROM OpenQuery(FileSystem, 

               'SELECT DocAuthor, DocTitle, DocSubject, FileName 

                FROM SCOPE('' "/corpus" '',

                           '' "C:\temp" '' )

                WHERE DocWordCount >= 5000 AND

                      ( DocAuthor = ''Wendy Vasse'' OR

                        DocAuthor = ''Anas Abbar'' )

                ORDER BY DocPageCount DESC                    

               ' 

              )  

  

Path Through the Components

The following figure contains the code for a typical query and a diagram that illustrates the part played by each component in processing the query.

These items relate to the numbered items in the previous diagram.

  1. The query is submitted to SQL Server, which separates the distributed portion of the query (the OPENQUERY() function) and forwards it to the SQL Server Distributed Query Handler.
  2. The Distributed Query Handler passes the inner SELECT statement specified in the OPENQUERY() function to the to the OLE DB Provider for Index Services (MSIDXS) that has been linked to FileSystem.
  3. MSIDXS parses the inner SELECT statement, then issues the appropriate commands to Index Services.
  4. Index Services forms a virtual table by combining the files in the Web text search catalog’s /Corpus virtual directory with the files in the C:\Temp path, selecting only files that contain 5,000 words or more. The results are returned as a rowset to MSIDXS.
  5. MSIDXS returns the rowset to the Distributed Query Handler.
  6. The Distributed Query Handler returns the rowset as a table to its calling component within SQL Server. This component, in turn, fulfills the rest of the query by joining this table with the writers table, and then returns the result set to the originator of the query.