A Catalog of Success

Rob Macdonald

Curious about what Microsoft's Index Server can do for you? Let Rob explain. He'll show you the differences among three approaches by creating the same Index Server application three different ways: with a standard VB5 program, an Active Server Page using VBScript, and a VB5 ActiveX Document application.

I'm going to talk a lot about Microsoft's Index Server in this article, but the motivation to write it comes from a technology I've already introduced through these pages [See "ADO: Learn to Love It" in the April 1998 issue. -- Ed.], namely Active Data Objects (ADO). These are the objects that are currently replacing DAO and RDO as the primary means of accessing data from Windows programs.

What sets ADO apart from these others is that it isn't designed specifically to work with relational data, but it has been designed as a general mechanism for data access. Index Server isn't built out of tables. Instead, it provides a catalog of all the words found in a set of documents and allows fast and sophisticated document retrieval searches to be executed on the full text (or just parts) of the documents -- via standard ADO commands.

So while you read this, keep in mind that I'm saying as much about ADO's general approach to data access as I am about Index Server's impressive capabilities. In a future article, I'll take the ADO theme even further and delve into what's involved with writing your own ADO servers in VB. But for now, let me show you how everything you've learned about data access can be applied to something completely different.

Index Server described
Index Server's main function is as a Web search engine. Many of the most popular Web sites use it to catalog their content so that users can get to the right page quickly. You'll see how to build exactly this kind of search engine, but you'll also see that Index Server and ADO can be coupled for a wide range of applications where document indexing is required, both on and off the Web.

As a product, Index Server keeps track of all documents that are placed in a set of nominated directories. It uses spare processing time to process any changes to the documents it tracks, so that it keeps its indexes up to date. The indexes are just like the indexes of a book -- by reading them you find out all references to specific words, so that you can rapidly find relevant material. The main difference is that Index Server will automatically catalog many types of document, including text, HTML, and all Microsoft Office documents. The indexes can then be queried using techniques ranging from simple word lookup to complex proximity searches. Index Server is intended to be a zero-maintenance product -- once it's installed and pointed at its directory spaces, it does its job entirely in the background. Here's an example of a simple Index Server query:

SELECT DocAppName, DocTitle, DocWordCount, HitCount 
FROM SCOPE(' "/IISSamples/ISSamples" ') 
WHERE CONTAINS(' "object" near() "oriented"') > 0 
ORDER BY HitCount DESC


This returns a standard ADO RecordSet that can be processed exactly like a database query. You might have noticed that the command structure is suspiciously SQL-like. This is entirely intentional and is aimed at making Index Server queries easy to decipher. However, you'll also have noticed some distinctly un-SQLish constructs that are specific to document retrieval, which we'll explore later.

Index Server comes with a sample Web page that has some good examples of using ADO and Active Server Pages (ASPs). The Web page produces attractive output, but I found one or two bugs in the ASP pages. Fortunately, it was easy to comment out the offending lines, and then the page worked nicely, producing output from the preceding query that can be seen in Figure 1. The page is accessible from the Start menu if you've installed the NT4 Option Pack.

How to get it
If you want to try out the examples described in this article, then you'll naturally need to have Index Server and ADO installed, and for one of the examples, ASPs are required. The good news is that everything you need (apart from VB5 and NT4) is free. The easiest way to get all of these components installed is to get hold of an NT4 server or workstation and install the NT4 Option Pack. This requires NT Service Pack 3 and IE 4 to be installed, but these come as part of the NT4 Option Pack setup. The Option Pack itself is freely downloadable from http://www.microsoft.com/ntserver/nts/downloads/recommended/NT4OptPk/default.asp, and TechNet and MSDN subscribers should also have it on their CDs. You can pick and choose the bits of the Option Pack that you want, so it doesn't need to be a heavyweight installation.

Index Server only runs on NT4 systems, so accessing it from a Win9x system requires the use of DCOM or http to communicate with a server. All the code shown in this article must be executed on an NT4 server or workstation.

A set of standard documents comes with the Index Server installation, but it's more fun if you use your own documents. In a fit of self-promotion, I've taken the text from some of the articles I've previously written for Visual Basic Developer and placed them in the c:/InetSrtv/iissamples/issamples directory on my NT server. This directory is automatically cataloged by Index Server, so it saved me from creating new catalogs.

Index Server and ADO in action
To show how the ADO/Index Server combo works, I'm going to create the same basic program in three forms: a standard VB5 program, an ASP using VBScript, and a VB5 ActiveX Document application. All three programs will execute an Index Server query and display the results using ADO, including the document title and path, the application used to create the document, and the number of hits (that is, the number of times the query term was found in the document). The latter two examples will have an added bonus: Both display inside a Web browser and have hyperlinks that jump directly to a document when its title is clicked. To keep the code samples short, the query term is hard-coded into the command string. It's easy enough to modify the programs to be more generic, as I'm sure you'll realize.

Listing 1 shows all the code required to get with started with VB5. The application has only one form, with a single ListView control, which can be added into the project by clicking "Project-Components" and selecting "Microsoft Windows Common Controls 5." The ListView will be used in "report-mode," so be sure to set the "View" property to "lvwReport" in the Properties box.

Listing 1 Using ADO and Index Server from VB.

Private Sub Form_Load()
On Error Resume Next
ListView1.ColumnHeaders.Add , , "Title", 2000
ListView1.ColumnHeaders.Add , , "Hits", 500
ListView1.ColumnHeaders.Add , , "Application", 1500
ListView1.ColumnHeaders.Add , , "Path", 5000
Set rs = CreateObject("ADODB.Recordset")
sCommand = "SELECT "  & _
   " DocTitle,DocAppName,Path,HitCount" & _
   " FROM SCOPE(' ""/IISSamples/ISSamples"" ') " & _
   " WHERE CONTAINS(' ""RDO"" ')> 0 "  & _
   " ORDER BY HitCount DESC"     
rs.Open sCommand, "Provider=MSIDXS"
While Not rs.EOF
   Set Item = ListView1.ListItems.Add(, , rs_
      ("DocTitle"))
   Item.SubItems(1) = rs("HitCount")
   Item.SubItems(2) = rs("DocAppName")
   Item.SubItems(3) = rs("Path")
   rs.MoveNext
Wend
End Sub

The most remarkable thing about this program is probably how familiar it looks. If you've been using ADO with databases already, the only new feature will be the syntax of the command string. Even if you're an RDO or DAO user, the code hardly needs any explaining.

One thing to note is the connect string. When you use ADO against a database, you provide exactly the same connect string you'd use with RDO. This is because the default provider for ADO is deemed to be the OLE/DB ODBC provider. To use any other provider, the provider name must be explicitly named, and for Index Server, this means using the string "MSIDXS". Figure 2 shows the resulting display.


Another point is the use of paired double-quotes (for instance, ""RDO""). The Index Server command syntax uses double-quotes to identify search patterns and the like. In a VB program, however, putting a double-quote in a string will terminate the string and potentially confuse the VB compiler. To prevent this unwanted behavior, prefix each double-quote with another double-quote, which forces it to be treated as a literal character and not a string terminator.

(If my lack of explicit variable declarations in Listing 1 offends you, my excuse is that I wanted to make my VB code as similar as possible to the VBScript code in the next example, so that comparison between the two listings is easier. I've also used late binding rather than early binding to create the objects used in the code for the same reason. While you're still in a forgiving mood, I'll get straight into the ASP example and go on to look at the Index Server command syntax in more depth later.)

Introducing ASPs
There isn't enough space to go into much detail of using ASPs to run VBScript on a Web server in this article, but Figure 3 shows the basic idea. When a standard HTML file is requested from a Web Server, it's shipped "as is" to the client's browser. However, if the file has an .asp extension (and ASP is enabled), any server-side scripting contained in the file is processed by the ASP DLL and woven into the regular page before being transmitted to the client. The server-side scripting can be as complex as you like and can access server resources such as databases, Index Server (using ADO in both cases), and ActiveX components. As arguments can be passed to an ASP, the result can be very dynamic Web pages. As the output is pure HTML, just about any client platform is supported.


Listing 2 shows the ASP Web page equivalent of the VB program. If you're unfamiliar with ASP code, everything that appears between <% and %> is server-side script, and the rest is standard HTML. When the ASP DLL sees any script, it executes it, and any resulting text is woven into the rest of the file, resulting in pure HTML. The server code logic is strictly followed, so that any HTML that appears "inside" a script loop is executed multiple times.

Listing 2 The ASP equivalent of Listing 1.

<HTML>
<HEAD>
<TITLE>ADO / Index Server Document Search Page</TITLE>
</HEAD>
<BODY>
<%  Set rs = Server.CreateObject("ADODB.Recordset")
   sCommand = "SELECT " & _
   " DocTitle, Path, DocAppName, HitCount" & _
   " FROM SCOPE(' ""/IISSamples/ISSamples"" ')" & _
   " WHERE CONTAINS(' ""RDO"" ')>0" & _
   " ORDER BY HitCount DESC"
    rs.Open sCommand,  "Provider=MSIDXS"   
    While Not rs.EOF 
%> 
<BR>     
   <B><% = rs("DocTitle") & "  "   %></B>
   (<% = rs("HitCount")          %>)
   <I><% = rs("DocAppName") & "  " %></I>
   <% = rs("Path") & "  "       %>
 <BR>
<%  rs.MoveNext
    Wend
%>
</BODY>
</HTML>


To run this code, simply place it in your NT Web Server's file space, and make sure that "Execution" is enabled for the directory where you placed the file. Also make sure it has a .asp extension; otherwise, it will be ignored by ASP. You can then navigate to the file using http. The HTML that gets sent to your browser contains each document name in bold, the hit count in brackets, and the application name in italics. The file path appears in regular text.

Rather than display the file path, it would be nicer to provide the user with a direct hyperlink to the file. On a Windows computer, clicking on this hyperlink would display the selected file. This simply means inserting an HTML anchor -- that is, <A> HREF = .. </A> -- into the Web page. The following code segment can be used to replace the elements between the "While" and the "MoveNext" in Listing 2 to achieve this aim. This is a bit of a cheat because it assumes that the browser is being used on the Web server. To make it work correctly from a remote browser, the path would need to be massaged slightly to refer to the server name instead of "C:/" -- but I'm sure you get the idea:

<BR>      
   <A HREF=  <% = rs("Path")              %> >
   <% = rs("DocTitle") & "  "   %></A>
   (<% = rs("HitCount")          %>)
   <I><% = rs("DocAppName") & "  " %></I>
<BR>


Before looking at the ActiveX Document example, it's worth exploring the Index Server command syntax. Index Server queries can be very sophisticated, and the online help will take you through its intricacies. You can get a feel for what's possible, though, just by exploring a few of the main features.

You can include any of the columns shown in Table 1 in the SELECT list or ORDER BY clause in an Index Server command, much as you would in a standard SQL statement.

Table 1. Standard Index Server columns.
Access AllocSize Attrib Characterization
ClassId Create DocAppName DocAuthor
DocCharCount DocComments DocCreateDTM DocEditTime
DocKeywords DocLastAuthor DocLastPrinted DocLastSaveDTM
DocPageCount DocRevNumber DocSecurity DocSubject
DocTemplate DocTitle DocWordCount FileIndex
Filename HitCount Path Rank
ShortFileName Size USN Vpath
WorkId Write  


Where things start to differ is in the FROM clause, because all of a sudden you become very aware that you aren't dealing with tables.

The FROM clause is typically a SCOPE() specification that identifies the virtual roots or directories to perform the search. By default, all the subdirectories of the virtual root are searched, but you can restrict the search to exclude subdirectories. Although all the scope examples I've used have only one virtual root, multiple roots can be included in the SCOPE statement. Alternatively, the SCOPE arguments can be left empty, in which case all file spaces tracked by Index Server are searched. There are also a number of predefined views that can be used in FROM clauses.

The most interesting part of the command is the WHERE clause. This can be a simple comparison, such as DocAuthor = 'Rob Macdonald', but the real power of the search engine starts to become apparent when using the CONTAINS statement. A simple CONTAINS statement might look for one or a combination of words in the contents of a document, but it could also look in any part of the document; for example:

WHERE CONTAINS(DocTitle, ' "Index" AND "Server" ') > 0


Alternatively, a proximity search will look for one word near another:

WHERE CONTAINS(' "Visual" Near() "Basic" ') > 0


More sophisticated still are the so-called "fuzzy" searches, such as:

WHERE CONTAINS('FORMSOF(INFLECTIONAL,"drive") ') > 0


Such searches go beyond simple wildcards to examine the root of a word and will match against "drive", "driving", "driven", or "drives". Apart from the weird syntax, it's really very easy to exploit this much power.

My favorite, however, is the FREETEXT search, which works like "Mr. Clippy" of Office Assistant fame when you type a question into the Help balloon. FREETEXT searches analyze the meaning of the search criteria as well as just the words it contains. For example, the following WHERE clause generates the results shown in Figure 4:

WHERE FREETEXT 
('Does early binding improve performance ?') > 0


Here, the number of hits for the most relevant documents can be very high.

The VB part
I'm going to wrap up this article by returning to VB code and converting the program described earlier into an ActiveX Document. This will create a VB program that can display itself in a Web browser such as Internet Explorer. If you've never created an ActiveX Document in VB before, it might be worth having a go now, because as you'll see, it could hardly be easier. There are eight steps:

1. Open or create a project containing the code shown in Listing 1. Check that it works as a standard VB project. Name the form "frmSearch."

2. From the Add-Ins menu, select "Add-In Manager" and check "VB ActiveX Document Migration Wizard."

3. Click on the Add-Ins menu again and select "ActiveX Document Migration Wizard."

4. When the wizard appears, click "Next," select the form name, and then click "Finish."

5. From the Project menu, select Properties. Give the project a name (for instance, ADOSearch), and make sure the Start Up object is set to "(None)". Change the Project Type to "ActiveX DLL." Click OK.

6. Note the name of the "UserDocument" object that the wizard created (it's probably docSearch). Open up the UserDocument object and double-click on the list view control. Add the following code to its "ItemClick" event procedure:

'navigate to the path of the clicked item
UserDocument.Hyperlink.NavigateTo Item.SubItems(3)


7. From the File menu, compile the project.

8. The compilation will create an ActiveX Document with a .vbd extension (it should be called docSearch.vbd). Start Internet Explorer and use "File-Open" to navigate to your .vbd file.

The VB ActiveX Document should now be displayed in your browser. The extra line of code added in Step 6 provides hyperlinking to any file name that you click on. It's my guess that if you've read this far, it's only because you're dying to find out what "The RDO Song" is that appeared in Figure 2, so as a reward, Figure 5 shows IE4 displaying the ActiveX Document, and next to it, the result of hyperlinking to "The RDO Song."

Conclusion
I hope this article has been more than just a showcase for Index Server. Index Server is indeed impressive, and it provides new ways for VB programmers to get at data. But the really impressive feature of this tour is that you didn't need to learn any new programming tricks to get data from Index Server. A quick scan of the documentation for creating Index Server queries is all that's needed, as all the rest of the data access code used an existing programming model, ADO. Expect to see more and more applications of ADO as the industry gears up to providing services through its generic interface.

If you've yet to investigate how well your VB training has prepared you for doing "Web stuff," I hope the examples included here have whetted your appetite. [You might want to download the "Microsoft Index Server" and "Index Server 2.0: What's New and Changed" white papers from www.microsoft.com/NTServer/Basics/TechPapers. -- Ed.]

Download sample code for this article here.


Rob Macdonald is an independent software specialist based in London and southern England. In addition to consulting and training in Windows, client/server, VB, COM, and systems design and management, he also runs the UK ODBC User Group and is author of RDO and ODBC: Client Server Database Programming with Visual Basic, published by Pinnacle. Rob can be contacted in England at +44 1722 782 433 or via e-mail at rob@salterson.com.