PRB: Use of Clustered Index Improves Self-Join Performance

ID: Q64259

The information in this article applies to:

Microsoft SQL Server for OS/2, version 4.2

SYMPTOMS

We have a table search_list that is created using the following command:


   CREATE TABLE products..search_list
   (keyword                char(10) NOT NULL,
    catalog_number         char(7) NOT NULL)

Approximately 350,000 rows of data are downloaded into this table from an ASCII file using the SQL bulk copy utility. The following commands are then run to establish the file index:


   CREATE INDEX search_index ON search_list(keyword)
   UPDATE STATISTICS search_list

When we perform a text search, we want to find all catalog numbers that match every keyword provided for the search. An individual key search may produce up to 6000 matches.

Method 1, as follows, incorporates a self-join of the search_list table:


   SELECT s1.catalog_number FROM
           search_list s1, search_list s2, search_list s3
   WHERE   s1.keyword LIKE 'ultra%' AND
           s2.keyword LIKE 'film%' AND
           s3.keyword LIKE 'color%' AND
           s1.catalog_number = s2.catalog_number AND
           s2.catalog_number = s3.catalog_number

Method 2, as follows, uses SELECT INTO to form several intermediate tables, and has proven to be the most time efficient method:


   SELECT catalog_number INTO S1 WHERE keyword LIKE 'ultra%'
   SELECT catalog_number INTO S2 WHERE keyword LIKE 'color%'
   SELECT catalog_number INTO S3 WHERE keyword LIKE 'film%'

   SELECT S1.catalog_number FROM S1, S2, S3
   WHERE  S1.catalog_number = S2.catalog_number AND
          S2.catalog_number = S3.catalog_number

   DROP TABLE S1
   DROP TABLE S2
   DROP TABLE S3

When performing a single key search, the difference between the two methods is minimal. However, as the number of keys grows, or the number of matches per key increases, Method 2 is increasingly more efficient than Method 1.

Is there something in our system or data configuration that could cause the self-join approach to degrade?

WORKAROUND

To improve the performance of the self-join, drop the search_index index and re-create it as a clustered index as follows:


   CREATE CLUSTERED INDEX search_index ON search_list(keyword)
   UPDATE STATISTICS search_list

SQL Server will then use the index as opposed to a table scan when optimizing the Method 1 query. Please note that there may be only one clustered index per table. The clustered index ensures that the data is stored in the order of the specified keys (in this case, keyword order). For queries on larger tables, the optimizer will use a clustered index where applicable to reduce the number of rows that need to be searched. While this can markedly improve your query performance, there is an associated overhead involved when new data is inserted into or deleted from the table because the data must be stored in sorted order. Therefore, a clustered index is most useful for relatively static tables.

If SET SHOWPLAN ON is activated prior to execution, some of the internal steps taken by the Server in processing a query can be viewed. In this case, it showed that the index was not being used before it was changed to a clustered index.

Keywords : kbusage SSrvTran_SQL SSrvServer
Version : 4.2
Platform : OS/2
Issue type :