Building a Search Engine for the Duwamish Books, Phase 4 Database

David Willson
Microsoft Developer Network

June 1999

Summary: Discusses the design and development of a fast, efficient search engine for the Duwamish Books, Phase 4 database. (7 printed pages)

Introduction

The Phase 3.5 system was a proprietary three-tier system. The end users of the client application were clerks and administrators who could be trained to understand the most efficient way to use the system. The design of the Duwamish Books, Phase 3.5 database was sufficient for its purpose.

In Phase 4 we have moved to a Web-based multitier system. Our end users are no longer just the clerks and administrators. The customers are now end users. We cannot expect our customers to be patient or to learn how to use our search engine. Therefore, speed and ease of use have become the focus of our system design.

We want our Phase 4 database, our search engine in particular, to operate as fast as it can. With millions of items in the catalog, and potentially tens of thousands of customer transactions to track and record, we needed to carefully design the indexes and procedures so that server I/O is minimized. However, we also need our search engine to be flexible enough to handle different search techniques, including keyword search and natural language queries.

Optimizing the Duwamish Database

The Duwamish Books, Phase 4 database is part catalog and part order entry system. The two parts work together, but they function in different ways. We will need to identify these differences to choose the right optimization strategy.

The data in the catalog tables is generally static. We insert the items once, but make many selections. We will need to choose an optimization strategy that will produce the best results for selection as opposed to insertion.

The volume of data in the order entry part of the system is constantly changing, but it needs to be just as accessible as the catalog. For this reason we need to optimize this part of the system for both insert and selection.

Frequently, Phase 4 stored procedures and English Query relationships will extract data from many tables at a time. This is accomplished by using join logic. We will need to optimize the performance of the stored procedure join clauses and English Query entity relationships where join logic is used.

Designing Procedures

Without stored procedures it would be necessary for client programs to access the tables and views in the database directly. This is referred to as ad hoc querying. Ad hoc querying is not a bad technique, but there can be serious negative consequences to this type of implementation. Any time a change is made to the database, such as object permissions, table structure, column data types to name a few, all client programs that reference the affected objects would need to be modified accordingly and then recompiled.

We continue to use stored procedures as our interface to the database objects in our Phase 4 implementation. By revising the join logic in the stored procedures we were able to preserve the entity relationship model that we had in our Phase 3.5 database while allowing us to design a more efficient and normalized database. This means that even though the tables and columns have been rearranged, the signature and output of the stored procedures remains unchanged from Phase 3.5 to Phase 4.

As the database becomes more normalized, the join logic becomes more important. (For a discussion of the normalization of the Phase 4 database, see "Normalizing the Duwamish Books, Phase 4 Database.") It is important to know the difference between an inner join and an outer join. Using the wrong kind of join will return too few or too many records. The inner join will yield a result only when there is a matching record in both tables. The outer join will yield a result depending on whether the matching record in the second table (the outer table) is present. If the matching record is not present for an outer join, all columns in the selection from the outer table will be set to null.

Designing the Keyword Search Tables

The Phase 3.5 keyword search used a single column stream of keywords. There are several problems with this type of implementation. One of the problems is that all records and each word must be scanned during a search to find a matching keyword. This is done using the "like" operator in the search. An index applied on this column would be useless because only the first word in each list would be indexed. Another problem is inconsistent case or variation in spellings that prohibit a complete resultset.

A more versatile keyword search needed to be developed for the Phase 4 Web implementation. With more items in the database to search through, searches using the "like" operator would take too long. We developed an efficient search structure that uses indexes, joins, and aggregations to produce an ordered resultset based on a keyword search.

Keywords table

The design of our Keywords table addresses the possibility of duplications and mixed case entries quite well. We install a uniqueness constraint on the Keyword column to prevent duplicate entries. To account for databases installed on case-sensitive servers, all keywords are forced to uppercase as they are inserted into the Keywords table. Following through with this convention, the corresponding stored procedures contain logic to convert search strings to uppercase.

Keyword joining tables

There is a many-to-many relationship between the Items table and the Keywords table. This means that an item can have more than one keyword and any keyword can refer to more than one item. The joining table is the physical representation of the many-to-many relationship. The joining table consists of two foreign keys that refer to the primary keys of the tables at either end of the data relationship. With nonclustered indexes installed on both foreign key columns, we have observed a beneficial tenfold decrease in the time it takes to retrieve an item record using a keyword to identify items that have that keyword. Observe the many-to-many relationship example in Figure 1.

Figure 1. Database schema showing a many-to-many relationship between Items and Keywords

In the Duwamish database, we want to record author, subject, and title keywords separately. The joining tables for these three book attributes could have been stored in a single joining table (as shown in the preceding example), but with a third column to identify the type of relation (author, subject, or title). However, these tables have the potential to contain more records than any other table in the database. In this case, conservation of space is important to consider. To decrease the demand for space that this data could incur, we eliminate the third column and divide the data into three identical two-column tables. These tables are ItemAuthorKeyword, ItemSubjectKeyword, and ItemTitleKeyword.

Producing an ordered resultset

The objective of any search is to use a customer-provided list of keywords to find records that contain matches for each of the specified keywords. Depending on the number of keywords supplied and the Boolean syntax used, many items may be returned, each matching some or all of the search keywords. To order the resultset in such a way that the best matches will appear at the top of the list, we have designed search stored procedures that follow these steps:

Find the primary key for each of the keywords specified by the customer.
Find all the records in the joining table where the foreign key of a keyword matches one of the primary keys found in step 1. Record the results into a temporary table.
Use the "count" aggregation with the SQL "group by" clause to create a resultset of items from the temporary table in step 2. The count will represent the number of keywords that are related to the item. Because there may be thousands of records returned, this resultset should be ordered by the result in the count column first and then by the item identifier column.

The GetItemsByTitleKeywords stored procedure uses the logic expressed in steps 1 through 3 just described. The stored procedure produces an ordered resultset of items that best match the comma-delimited list of title keywords specified by the user. This is accomplished by using two temporary tables. For each keyword specified, the identifier of the book that contains the keyword is inserted into the temporary table. When this action is completed, a "group by" clause is used to count the duplications in the first temporary table (this is step 3 in the preceding example). The result of this query containing the "group by" is saved to the second temporary table. The following SQL expression is a part of the GetItemsByTitleKeywords procedure that demonstrates how the duplications are counted in one temporary table and stored in another:

CREATE TABLE #ItemTitle2 (ItemId INT, MatchLevel INT)
INSERT INTO #ItemTitle2 (ItemId, MatchLevel)
  SELECT ItemId, COUNT(*)
    FROM #ItemTitle GROUP BY ItemId

Designing the English Query Definitions File

One of the search options for Phase 4 is to use natural language, or English Query. With this tool it is no longer necessary to teach the user the semantics of the SQL "where" clause or a similar rigid Boolean-style interface. Instead, the user simply uses a natural language question to find the content of interest.

Choosing database entities to expose

To show some of the flexibility of English Query, we allow the user to search for books based on author, title, subject, ISBN, publisher, and year of publication.

It is possible, but not necessary, to include all database objects in the English Query project file. English Query builds a selection statement based on database objects that are defined as entities or entity relationships in the English Query project file. So, only the catalog-related tables are incorporated into the English Query project file.

We could configure English Query to produce resultsets that contain all the columns that would be necessary to display to the user. Instead, we restrict the output of English Query to a single column of the primary keys that represent the books that meet the English Query search criteria. This way we can reuse existing business logic code to reformat the results of the query in a consistent manner. (For more information on our Business Logic Layer component see Steve Kirk's article "Migrating a Business Logic Component to a Web Application").

Configuring English Query entities and relations

The English Query engine translates the subject and objects in an English sentence into entities. Entities can be either tables or columns in the database. For example some of the entities we use in our Duwamish English Query project are "book," "author," and "ISBN." The way these entities are related to each other must be described by phrasings. These phrasings are expressed as "books have authors" and "authors write books." Once the entities are identified and the phrasings are complete, the project is compiled into what is called an English Query Domain (EQD) file. The English Query engine uses this compiled file to translate a natural language question into a valid SQL statement. For this reason, it is not necessary for English Query to have an active connection to the server that hosts the database the compiled file was modeled after.

Testing English Query entities and relations

To test each type of search, we developed a set of natural language questions that should return a selection from the sample data provided. The English Query development environment provides an interface to test or debug the phrasings. When a natural language question is executed, the English Query returns two types of results. The restatement is the English Query engine's interpretation of the user's question or statement. The other type of result is the SQL expression.

In the examples that follow, observe how the simple phrasings provide the context to understand the meaning of the test questions.

"What books are about both France and war?" English Query should identify both France and war as subjects and then cross-reference these with books. We can do this by creating the phrasings "books have subjects" and "subject names are the names of subjects." For this question and the one that follows, a preposition is used. Therefore, we need the phrasing "books are about, on subjects," which means that books are either about subjects or books are on subjects. The restatement for this question is "List the France and war books." Notice how the English Query engine was able to identify both "France" and "war" as elements of the entity "subjects."

"What books are on Presidents?" English Query should identify Presidents as a subject. Notice that the Keywords table contains strings that are only uppercase. When English Query recognizes a word as a subject, the word is converted into uppercase in the question rephrasing. In this case, "Presidents" is converted to "PRESIDENTS."

"Show books by Steinbeck." English Query should identify Steinbeck as part of an author's name. In the Duwamish database, the full author name is in one column (which is a good globalization strategy, because not all author names are formed the same way—some do not use the Roman alphabet nor a European name format). English Query could also recognize two columns such as FirstName and LastName as a name entity in a database where the names are stored in that manner.

"What books did Herman write?" English Query should identify Herman as part of an author's name. To describe the phrasings of this relationship, use both "books have authors" and "authors write books."

"Which books have a title that contains Mob?" English Query should interpret Mob as part of a word in a book's title. We should expect to get the book "Moby-Dick" back in our list of results. Use both of the phrasings "books have titles" and "titles are the names of books."

"What books were published in 1998?" English Query should identify 1998 as a year of publication. Use the phrasings "books are published" and "books are published in publication years."

"What books were published before 1988?" English Query should identify 1988 as a year. For this question, use the adjective phrasing "publication years indicate how new books are."

"What books are published by Trees to Mush?" In our sample database "Trees to Mush" is the name of a publisher. English Query should identify "Trees to Mush" as the name of a publisher. Use the phrasing "publisher names are the names of publishers" to identify the relationship between the publisher name as an attribute of a publisher. Then use the phrasings "books have publishers" and "publishers publish books" to define the relationship between books and publishers.

Once these questions are functional, the customer should be able to search for books using an assortment of criteria.

"Which books did DeTocqueville write about Revolution?" For this question, English query should identify DeTocqueville as an author and Revolution as a subject.

Designing Indexes

Once the stored procedures are designed and the English Query relationships are established, we identify the columns throughout the database that will participate in joins or column selections. Columns that participate in a join or column selection need to be optimized for a quick selection. Columns that will serve as a primary key or unique key must be defined before foreign keys can reference them. This is accomplished by creating the proper type of index on the column and applying constraints on the columns as the table is created.

Constraints

Primary keys and foreign keys are constraints that define the referential integrity of the system. Unique keys are imposed in places to support a business rule. In the Duwamish Books database these constraints are incorporated into the table declarations. This is known as declarative referential integrity (DRI). It is far easier to let SQL Server manage referential integrity with DRI than to design and use insert, update, and delete triggers for this same purpose.

Clustered index vs. nonclustered index

When a clustered index is established on a table, the leaf nodes of the index tree are the actual data pages. This means that when you make an unordered selection from the table, the data will appear in the order specified by the clustered index. Because the clustered index defines the way the server stores the data in a table, there can be only one clustered index per table. However, there is not much gain in performance to define a clustered index on a table unless the columns that define the clustered index contain generally repetitive values; if all the values are unique, you would be wasting both space and I/O time with a clustered index.

A nonclustered index points to the location of the data at the leaf nodes of the index. You can add numerous nonclustered indexes to a table.

To optimize a selection SQL Server identifies the columns involved in the where clause and the selection clause and creates an efficient hash table to organize the data as fast as possible. In the Duwamish database we identified columns that are known to participate in join clauses and some selections and applied a nonclustered index.

The performance gain we observed on our Pentium II test server was quite impressive. By adding a nonclustered index on both columns of the ItemTitleKeyword table, we were able to execute a query searching for seven keywords amongst 350,000 titles in 1.8 seconds, 40 seconds without the index applied.

Summary

A search engine for a Web client on a large database must be versatile, efficient, and fast. We have provided several avenues for search, including keyword search, optimized search by title and author, as well as natural language search. Using database normalization techniques, we have reduced wasteful duplications of data. In our most heavily used tables, we have found a way to minimize the size of the record and preserve functionality to conserve resources. Identifying the proper columns to install indexes on is a critical part of improving the speed of the database. By applying indexes to the columns in tables that participate in joins in stored procedures and in English Query phrasings we were able to realize more than tenfold increases in performance.

Resources

To learn more about English Query, including how to create phrasings for domain files, download the white paper "Developing with Microsoft English Query."

For specific technical information on database tuning, including index selection, see "Microsoft SQL Server 7.0 Performance Tuning Guide."

Comments? We welcome your feedback at duwamish@microsoft.com.