A Catalog of Success
Rob Macdonald
Curious about what Microsoft's
Index Server can do for you? Let Rob explain. He'll show you the differences
among three approaches by creating the same Index Server application three
different ways: with a standard VB5 program, an Active Server Page using
VBScript, and a VB5 ActiveX Document application.
I'm going to talk a lot about Microsoft's Index Server in this article,
but the motivation to write it comes from a technology I've already introduced
through these pages [See "ADO: Learn to Love It" in the April
1998 issue. -- Ed.], namely Active Data Objects (ADO). These are the
objects that are currently replacing DAO and RDO as the primary means of
accessing data from Windows programs.
What sets ADO apart from these others is that it isn't designed specifically
to work with relational data, but it has been designed as a general mechanism
for data access. Index Server isn't built out of tables. Instead, it provides
a catalog of all the words found in a set of documents and allows
fast and sophisticated document retrieval searches to be executed on the
full text (or just parts) of the documents -- via standard ADO commands.
So while you read this, keep in mind that I'm saying as much about ADO's
general approach to data access as I am about Index Server's impressive
capabilities. In a future article, I'll take the ADO theme even further
and delve into what's involved with writing your own ADO servers in VB.
But for now, let me show you how everything you've learned about data access
can be applied to something completely different.
Index Server described
Index Server's main function is as a Web
search engine. Many of the most popular Web sites use it to catalog their
content so that users can get to the right page quickly. You'll see how
to build exactly this kind of search engine, but you'll also see that Index
Server and ADO can be coupled for a wide range of applications where document
indexing is required, both on and off the Web.
As a product, Index Server keeps track of all documents that are placed
in a set of nominated directories. It uses spare processing time to process
any changes to the documents it tracks, so that it keeps its indexes up
to date. The indexes are just like the indexes of a book -- by reading them
you find out all references to specific words, so that you can rapidly find
relevant material. The main difference is that Index Server will automatically
catalog many types of document, including text, HTML, and all Microsoft
Office documents. The indexes can then be queried using techniques ranging
from simple word lookup to complex proximity searches. Index Server is intended
to be a zero-maintenance product -- once it's installed and pointed at its
directory spaces, it does its job entirely in the background. Here's an
example of a simple Index Server query:
SELECT DocAppName, DocTitle, DocWordCount, HitCount
FROM SCOPE(' "/IISSamples/ISSamples" ')
WHERE CONTAINS(' "object" near() "oriented"') > 0
ORDER BY HitCount DESC
|
This returns a standard ADO RecordSet that can be processed exactly like
a database query. You might have noticed that the command structure is suspiciously
SQL-like. This is entirely intentional and is aimed at making Index Server
queries easy to decipher. However, you'll also have noticed some distinctly
un-SQLish constructs that are specific to document retrieval, which we'll
explore later.
Index Server comes with a sample Web page that has some good examples of
using ADO and Active Server Pages (ASPs). The Web page produces attractive
output, but I found one or two bugs in the ASP pages. Fortunately, it was
easy to comment out the offending lines, and then the page worked nicely,
producing output from the preceding query that can be seen in Figure
1. The page is accessible from the Start menu
if you've installed the NT4 Option Pack.
How to get it
If you want to try out the examples described
in this article, then you'll naturally need to have Index Server and ADO
installed, and for one of the examples, ASPs are required. The good news
is that everything you need (apart from VB5 and NT4) is free. The easiest
way to get all of these components installed is to get hold of an NT4 server
or workstation and install the NT4 Option Pack. This requires NT Service
Pack 3 and IE 4 to be installed, but these come as part of the NT4 Option
Pack setup. The Option Pack itself is freely downloadable from http://www.microsoft.com/ntserver/nts/downloads/recommended/NT4OptPk/default.asp, and TechNet and MSDN subscribers should also have it
on their CDs. You can pick and choose the bits of the Option Pack that you
want, so it doesn't need to be a heavyweight installation.
Index Server only runs on NT4 systems, so accessing it from a Win9x system
requires the use of DCOM or http to communicate with a server. All the code
shown in this article must be executed on an NT4 server or workstation.
A set of standard documents comes with the Index Server installation, but
it's more fun if you use your own documents. In a fit of self-promotion,
I've taken the text from some of the articles I've previously written for
Visual Basic Developer and placed them in the c:/InetSrtv/iissamples/issamples
directory on my NT server. This directory is automatically cataloged by
Index Server, so it saved me from creating new catalogs.
Index Server and ADO in action
To show how the ADO/Index Server combo
works, I'm going to create the same basic program in three forms: a standard
VB5 program, an ASP using VBScript, and a VB5 ActiveX Document application.
All three programs will execute an Index Server query and display the results
using ADO, including the document title and path, the application used to
create the document, and the number of hits (that is, the number of times
the query term was found in the document). The latter two examples will
have an added bonus: Both display inside a Web browser and have hyperlinks
that jump directly to a document when its title is clicked. To keep the
code samples short, the query term is hard-coded into the command string.
It's easy enough to modify the programs to be more generic, as I'm sure
you'll realize.
Listing 1 shows
all the code required to get with started with VB5. The application has
only one form, with a single ListView control, which can be added into the
project by clicking "Project-Components" and selecting "Microsoft
Windows Common Controls 5." The ListView will be used in "report-mode,"
so be sure to set the "View" property to "lvwReport"
in the Properties box.
Listing 1 Using ADO and Index Server from VB.
Private Sub Form_Load()
On Error Resume Next
ListView1.ColumnHeaders.Add , , "Title", 2000
ListView1.ColumnHeaders.Add , , "Hits", 500
ListView1.ColumnHeaders.Add , , "Application", 1500
ListView1.ColumnHeaders.Add , , "Path", 5000
Set rs = CreateObject("ADODB.Recordset")
sCommand = "SELECT " & _
" DocTitle,DocAppName,Path,HitCount" & _
" FROM SCOPE(' ""/IISSamples/ISSamples"" ') " & _
" WHERE CONTAINS(' ""RDO"" ')> 0 " & _
" ORDER BY HitCount DESC"
rs.Open sCommand, "Provider=MSIDXS"
While Not rs.EOF
Set Item = ListView1.ListItems.Add(, , rs_
("DocTitle"))
Item.SubItems(1) = rs("HitCount")
Item.SubItems(2) = rs("DocAppName")
Item.SubItems(3) = rs("Path")
rs.MoveNext
Wend
End Sub
The most remarkable thing about this program is probably how familiar it
looks. If you've been using ADO with databases already, the only new feature
will be the syntax of the command string. Even if you're an RDO or DAO user,
the code hardly needs any explaining.
One thing to note is the connect string. When you use ADO against a database,
you provide exactly the same connect string you'd use with RDO. This is
because the default provider for ADO is deemed to be the OLE/DB ODBC provider.
To use any other provider, the provider name must be explicitly named, and
for Index Server, this means using the string "MSIDXS". Figure
2 shows the resulting
display.
Another point is the use of paired double-quotes (for instance, ""RDO"").
The Index Server command syntax uses double-quotes to identify search patterns
and the like. In a VB program, however, putting a double-quote in a string
will terminate the string and potentially confuse the VB compiler. To prevent
this unwanted behavior, prefix each double-quote with another double-quote,
which forces it to be treated as a literal character and not a string terminator.
(If my lack of explicit variable declarations in Listing 1 offends you,
my excuse is that I wanted to make my VB code as similar as possible to
the VBScript code in the next example, so that comparison between the two
listings is easier. I've also used late binding rather than early binding
to create the objects used in the code for the same reason. While you're
still in a forgiving mood, I'll get straight into the ASP example and go
on to look at the Index Server command syntax in more depth later.)
Introducing ASPs
There isn't enough space to go into much
detail of using ASPs to run VBScript on a Web server in this article, but
Figure
3 shows the basic
idea. When a standard HTML file is requested from a Web Server, it's shipped
"as is" to the client's browser. However, if the file has an .asp
extension (and ASP is enabled), any server-side scripting contained in the
file is processed by the ASP DLL and woven into the regular page before
being transmitted to the client. The server-side scripting can be as complex
as you like and can access server resources such as databases, Index Server
(using ADO in both cases), and ActiveX components. As arguments can be passed
to an ASP, the result can be very dynamic Web pages. As the output is pure
HTML, just about any client platform is supported.
Listing 2 shows
the ASP Web page equivalent of the VB program. If you're unfamiliar with
ASP code, everything that appears between <% and %> is server-side
script, and the rest is standard HTML. When the ASP DLL sees any script,
it executes it, and any resulting text is woven into the rest of the file,
resulting in pure HTML. The server code logic is strictly followed, so that
any HTML that appears "inside" a script loop is executed multiple
times.
Listing 2 The ASP equivalent of Listing 1.
<HTML>
<HEAD>
<TITLE>ADO / Index Server Document Search Page</TITLE>
</HEAD>
<BODY>
<% Set rs = Server.CreateObject("ADODB.Recordset")
sCommand = "SELECT " & _
" DocTitle, Path, DocAppName, HitCount" & _
" FROM SCOPE(' ""/IISSamples/ISSamples"" ')" & _
" WHERE CONTAINS(' ""RDO"" ')>0" & _
" ORDER BY HitCount DESC"
rs.Open sCommand, "Provider=MSIDXS"
While Not rs.EOF
%>
<BR>
<B><% = rs("DocTitle") & " " %></B>
(<% = rs("HitCount") %>)
<I><% = rs("DocAppName") & " " %></I>
<% = rs("Path") & " " %>
<BR>
<% rs.MoveNext
Wend
%>
</BODY>
</HTML>
To run this code, simply place it in your NT Web Server's file space, and
make sure that "Execution" is enabled for the directory where
you placed the file. Also make sure it has a .asp extension; otherwise,
it will be ignored by ASP. You can then navigate to the file using http.
The HTML that gets sent to your browser contains each document name in bold,
the hit count in brackets, and the application name in italics. The file
path appears in regular text.
Rather than display the file path, it would be nicer to provide the user
with a direct hyperlink to the file. On a Windows computer, clicking on
this hyperlink would display the selected file. This simply means inserting
an HTML anchor -- that is, <A> HREF = .. </A> -- into the Web
page. The following code segment can be used to replace the elements between
the "While" and the "MoveNext" in Listing 2 to achieve
this aim. This is a bit of a cheat because it assumes that the browser is
being used on the Web server. To make it work correctly from a remote browser,
the path would need to be massaged slightly to refer to the server name
instead of "C:/" -- but I'm sure you get the idea:
<BR>
<A HREF= <% = rs("Path") %> >
<% = rs("DocTitle") & " " %></A>
(<% = rs("HitCount") %>)
<I><% = rs("DocAppName") & " " %></I>
<BR>
Before looking at the ActiveX Document example, it's worth exploring the
Index Server command syntax. Index Server queries can be very sophisticated,
and the online help will take you through its intricacies. You can get a
feel for what's possible, though, just by exploring a few of the main features.
You can include any of the columns shown in Table
1 in the SELECT list or ORDER BY clause in
an Index Server command, much as you would in a standard SQL statement.
Table 1.
Standard Index Server columns.
Access |
AllocSize |
Attrib |
Characterization |
ClassId |
Create |
DocAppName |
DocAuthor |
DocCharCount |
DocComments |
DocCreateDTM |
DocEditTime |
DocKeywords |
DocLastAuthor |
DocLastPrinted |
DocLastSaveDTM |
DocPageCount |
DocRevNumber |
DocSecurity |
DocSubject |
DocTemplate |
DocTitle |
DocWordCount |
FileIndex |
Filename |
HitCount |
Path |
Rank |
ShortFileName |
Size |
USN |
Vpath |
WorkId |
Write |
|
Where things start to differ is in the FROM clause, because all of a sudden
you become very aware that you aren't dealing with tables.
The FROM clause is typically a SCOPE() specification that identifies the
virtual roots or directories to perform the search. By default, all the
subdirectories of the virtual root are searched, but you can restrict the
search to exclude subdirectories. Although all the scope examples I've used
have only one virtual root, multiple roots can be included in the SCOPE
statement. Alternatively, the SCOPE arguments can be left empty, in which
case all file spaces tracked by Index Server are searched. There are also
a number of predefined views that can be used in FROM clauses.
The most interesting part of the command is the WHERE clause. This can be
a simple comparison, such as DocAuthor = 'Rob Macdonald', but the real power
of the search engine starts to become apparent when using the CONTAINS statement.
A simple CONTAINS statement might look for one or a combination of words
in the contents of a document, but it could also look in any part of the
document; for example:
WHERE CONTAINS(DocTitle, ' "Index" AND "Server" ') > 0
|
Alternatively, a proximity search will look for one word near another:
WHERE CONTAINS(' "Visual" Near() "Basic" ') > 0
|
More sophisticated still are the so-called "fuzzy" searches, such
as:
WHERE CONTAINS('FORMSOF(INFLECTIONAL,"drive") ') > 0
|
Such searches go beyond simple wildcards to examine the root of a word and
will match against "drive", "driving", "driven",
or "drives". Apart from the weird syntax, it's really very easy
to exploit this much power.
My favorite, however, is the FREETEXT search, which works like "Mr.
Clippy" of Office Assistant fame when you type a question into the
Help balloon. FREETEXT searches analyze the meaning of the search criteria
as well as just the words it contains. For example, the following WHERE
clause generates the results shown in Figure
4:
WHERE FREETEXT
('Does early binding improve performance ?') > 0
|
Here, the number of hits for the most relevant documents can be very high.
The VB part
I'm going to wrap up this article by returning
to VB code and converting the program described earlier into an ActiveX
Document. This will create a VB program that can display itself in a Web
browser such as Internet Explorer. If you've never created an ActiveX Document
in VB before, it might be worth having a go now, because as you'll see,
it could hardly be easier. There are eight steps:
1. Open or create a project containing the code shown in Listing 1. Check
that it works as a standard VB project. Name the form "frmSearch."
2. From the Add-Ins menu, select "Add-In Manager" and check "VB
ActiveX Document Migration Wizard."
3. Click on the Add-Ins menu again and select "ActiveX Document Migration
Wizard."
4. When the wizard appears, click "Next," select the form name,
and then click "Finish."
5. From the Project menu, select Properties. Give the project a name (for
instance, ADOSearch), and make sure the Start Up object is set to "(None)".
Change the Project Type to "ActiveX DLL." Click OK.
6. Note the name of the "UserDocument" object that the wizard
created (it's probably docSearch). Open up the UserDocument object and double-click
on the list view control. Add the following code to its "ItemClick"
event procedure:
'navigate to the path of the clicked item
UserDocument.Hyperlink.NavigateTo Item.SubItems(3)
|
7. From the File menu, compile the project.
8. The compilation will create an ActiveX Document with a .vbd extension
(it should be called docSearch.vbd). Start Internet Explorer and use "File-Open"
to navigate to your .vbd file.
The VB ActiveX Document should now be displayed in your browser. The extra
line of code added in Step 6 provides hyperlinking to any file name that
you click on. It's my guess that if you've read this far, it's only because
you're dying to find out what "The RDO Song" is that appeared
in Figure
2, so as a reward, Figure
5 shows IE4 displaying
the ActiveX Document, and next to it, the result of hyperlinking to "The
RDO Song."
Conclusion
I hope this article has been more than
just a showcase for Index Server. Index Server is indeed impressive, and
it provides new ways for VB programmers to get at data. But the really impressive
feature of this tour is that you didn't need to learn any new programming
tricks to get data from Index Server. A quick scan of the documentation
for creating Index Server queries is all that's needed, as all the rest
of the data access code used an existing programming model, ADO. Expect
to see more and more applications of ADO as the industry gears up to providing
services through its generic interface.
If you've yet to investigate how well your VB training has prepared you
for doing "Web stuff," I hope the examples included here have
whetted your appetite. [You might want to download the "Microsoft
Index Server" and "Index Server 2.0: What's New and Changed"
white papers from www.microsoft.com/NTServer/Basics/TechPapers. -- Ed.]
Download sample code for this article here.
Rob Macdonald is an independent
software specialist based in London and southern England. In addition to
consulting and training in Windows, client/server, VB, COM, and systems
design and management, he also runs the UK ODBC User Group and is author
of RDO and ODBC: Client Server
Database Programming with Visual Basic, published by Pinnacle. Rob can
be contacted in England at +44 1722 782 433 or via e-mail at rob@salterson.com.