At least one defined table must be specified in every SQL query. A defined table is a table in which the number and types of columns are either known in advance or specified as part of the query. A relational database usually contains a number of defined tables, and metadata about the columns of these tables is stored in a schema.
The collection of files in a file system does not generally have a predefined structure. The closest thing to columns are the properties of a file, but there is no deterministic set of properties for files. The closest thing to a row is a file, but files are usually not grouped in a homogeneous collection akin to rows in a table. Thus, in this case, the table concept is unclear, SELECT * is meaningless, and both the rows and columns are unbounded. Another way of looking at this is that a file system effectively has a universal schema consisting of every possible file property, both known and unknown.
Index Services solves this problem by providing the SCOPE function as a means of defining the set of rows that makes up a virtual table and by providing file properties that substitute for columns.
The SCOPE function is specified in the FROM clause of the Index Services query. It specifies the set of files that make up a virtual table.
The syntax of the SCOPE function, simplified for this chapter, is as follows:
These are the definitions of the syntax terminology:
Index Services filters and maintains an excess of 50 file properties. All of these can be specified in text file search queries. From the perspective of writing a SELECT statement, there are three types of file properties:
This table outlines some of the file properties.
Property name |
SQL Data Type |
Description |
Use in ORDER BY clause | Use in select list |
---|---|---|---|---|
Access | datetime | Most recent date and time that the file was accessed. | Yes | Yes |
Characterization | nvarchar or ntext | Abstract of the contents of the file. In Index Services 2.0, this is usually the first paragraph or first section of a document. In future releases, it is planned to be a real summary. | – | Yes |
Contents | nvarchar or ntext | Main contents of the file. | – | – |
Create | datetime | Date and time that the file was created. | Yes | Yes |
Directory | nvarchar | Physical path to the file, not including the file name. | Yes | Yes |
DocAuthor | nvarchar | Document author. | Yes | Yes |
DocComments | nvarchar | Comments about the document. | Yes | Yes |
DocLastAuthor | nvarchar | Most recent user that edited the document. | Yes | Yes |
DocLastPrinted | datetime | Date and time that the document was last printed. | Yes | – |
DocPageCount | integer | Number of pages in the document. | Yes | – |
DocPartTitles | array of varchar | Names of the document parts: · in Microsoft PowerPoint (slide titles). · in Microsoft Excel (spreadsheets). · in Microsoft Word (documents). |
– | – |
DocSubject | nvarchar | Subject of the document. | Yes | Yes |
DocTitle | nvarchar | Title of the document. | Yes | Yes |
DocWordCount | integer | Number of words in the document. | Yes | – |
FileIndex | decimal (19,0) | Unique identifier of the file. | Yes | Yes |
FileName | nvarchar | Name of the file. | Yes | Yes |
HitCount | integer | Number of words matching the query. | Yes | Yes |
Path | nvarchar | Full physical path to the file, including the file name. | Yes | Yes |
Rank | integer | Value from 0 through 1,000, indicating how well this row matches the selection criteria. | Yes | Yes |
Size | decimal (19, 0) | Size of the file (in bytes). | Yes | Yes |
Write | datetime | Most recent date and time that the file was written. | Yes | Yes |
Customers and third-party vendors can write filters to add to this set of file properties. They can also add properties, for example, by adding tags to an HTML document. In addition, to permit the query and retrieval of such user-defined file properties, the SQL extensions to Index Services include support for a SET statement that allows the specification of new file property names and their associated types.
It is possible to specify a query with the equivalent of a table in the file system, resulting in the select list and the FROM clause. For other parts of a SELECT statement, the properties can be used in place of columns in the WHERE and ORDER BY clauses. However, the GROUP BY and HAVING clauses are not supported by OLE DB Provider for Index Services. The following examples illustrate the use of all supported clauses.
The following query selects the full physical path and the file creation timestamp of all files in the /SQL-standards virtual directory and all its subdirectories, where the document contains the phrase “overloaded function.”
SELECT *
FROM OpenQuery(FileSystem,
'SELECT Path, Create
FROM SCOPE('' "/SQL-standards" '')
WHERE CONTAINS(Contents, '' "overloaded function" '')
'
)
The next query is similar to the previous one, except that only files directly in the /SQL-standards virtual directory are considered.
SELECT *
FROM OpenQuery(FileSystem,
'SELECT Path, Create
FROM SCOPE('' SHALLOW TRAVERSAL OF "/SQL-standards" '')
WHERE CONTAINS(Contents, '' "overloaded function" '')
'
)
The next query is also similar, except that only files directly in the /SQL3 virtual subdirectory are considered.
SELECT *
FROM OpenQuery(FileSystem,
'SELECT Path, Create
FROM SCOPE('' "/SQL-standards/SQL3" '')
WHERE CONTAINS(Contents, '' "overloaded function" '')
'
)
The final query selects author, title, subject, and file name of documents in all files that are either in the /Corpus virtual directory and its subdirectories or in the \Temp directory on drive C, where the document is at least 5,000 words, the author is either Wendy Vasse or Anas Abbar, and the rows representing those documents with the most pages are ordered highest.
SELECT *
FROM OpenQuery(FileSystem,
'SELECT DocAuthor, DocTitle, DocSubject, FileName
FROM SCOPE('' "/corpus" '',
'' "C:\temp" '' )
WHERE DocWordCount >= 5000 AND
( DocAuthor = ''Wendy Vasse'' OR
DocAuthor = ''Anas Abbar'' )
ORDER BY DocPageCount DESC
'
)
The following figure contains the code for a typical query and a diagram that illustrates the part played by each component in processing the query.
These items relate to the numbered items in the previous diagram.