Clustered Index Selection

Clustered index selection involves two major steps: First, determine the column of the table that will benefit most from the clustered index by providing sequential I/O for range scans, and second, use the clustered index to affect the physical placement of table data while avoiding hot spots. A hot spot occurs when data is placed on hard disk drives so that, as a result, many queries try to read or write data in the same area of the disk(s) at the same time. This creates a disk I/O bottleneck because more concurrent disk I/O requests are received by the hard disk than it can handle. The solution is either to stop fetching as much data from this disk or to spread the data across multiple disks to support the I/O demand. This consideration for the physical placement of data can be critical for good concurrent access to data among hundreds or thousands of SQL Server users.

These two decisions often conflict with one another, and the best decision is to balance the two. In high user-load environments, improved concurrency (by avoiding hot spots) can be more valuable than the performance benefit gained by placing the clustered index on that column.

With SQL Server 7.0, nonclustered indexes use the clustered index to locate data rows if there is a clustered index present on the table. Because all nonclustered indexes must hold the clustered keys within their B-tree structures, it is better for performance to keep the overall byte size of the clustered index keys as small as possible. Keep the number of columns in the clustered index to a minimum and carefully consider the byte size of each of the columns chosen to be included in a clustered index. This helps reduce the size of the clustered index and subsequently, all nonclustered indexes on a table. Smaller index B-tree structures can be read more quickly and help improve performance. For more information, see SQL Server Books Online.

In earlier versions of SQL Server, tables without clustered indexes insert rows at the end of the table stored on the disk. This can create a hot spot at the end of a busy table. The SQL Server 7.0 storage management algorithms provide free space management that removes this behavior. When rows are inserted in heaps, SQL Server uses the Page Free Space (PFS) pages to quickly locate available free space in the table in which the row is inserted. PFS pages find free space throughout the table, which recovers deleted space and avoids insertion hot spots. Free space management affects clustered index selection. Because clustered indexes affect physical data placement, hot spots can occur when a clustered index physically sequences based on a column in which many concurrent inserts occur at the highest column value and are at the same physical disk location. For columns with monotonically increasing values, a clustered index sequentially orders data rows on disk by that column. By placing the clustered index on another column or by not including a clustered index on the table, this sequential data placement moves to another column or does not occur at all.

Another way to think about hot spots is within the context of selects. If many users select data with key values that are very close to but are not in the same rows, most disk I/O activity occurs within the same physical region of the disk I/O subsystem. This disk I/O activity can be spread out more evenly by defining the clustered index for this table on a column that spreads these key values evenly across the disk. If all selects are using the same unique key value, then using a clustered index does not help balance the disk I/O activity of this table. By using RAID (either hardware or software), you can alleviate this problem by spreading the I/O across many disk drives. This behavior can be described as disk access contention. It is not locking contention.

Clustered Index Selection Scenario

A scenario can illustrate clustered index selection. For example, a table contains an invoice date column, a unique invoice number column, and other data. About 10,000 new records are inserted into this table every day, and the SQL queries often search this table for all records for one week of data. Many users have concurrent access to this table. The invoice number is not a candidate for the clustered index. The invoice number is unique, and users do not usually search on ranges of invoice numbers; therefore, placing invoice numbers physically in sequential order on disk is not appropriate. Next, the values for invoice number increase monotonically (1001,1002,1003, and so on). If the clustered index is placed on invoice number, inserts of new rows into this table occur at the end of the table beside the highest invoice number on the same physical disk location, and create a hot spot.

Consider the invoice date column. To maximize sequential I/O, the invoice date column is a candidate for a clustered index because users often search for one week of data (about 70,000 rows). But for concurrency, the invoice date column may not be a candidate for the clustered index. If the clustered index is placed on an invoice date, all data tends to be inserted at the end of the table, and a hot spot can occur on the hard disk that holds the end of the table. The insertions at the end of the table are offset by the 10,000 rows that are inserted for the same date, therefore, invoice date is less likely to create a hot spot than invoice number. Also, a hardware RAID controller helps spread out the 10,000 rows across multiple disks, which can minimize the possibility of an insertion hot spot.

There is no perfect answer to this scenario. You can place the clustered index on invoice date to speed up queries involving invoice date ranges, even at the risk of hot spots. In this case, you should monitor disk queuing on the disks associated with this table for possible hot spots. It is recommended that you define the clustered index on invoice date because of the benefit to range scans based on invoice date and so that invoice numbers are not physically sequential on disk.

In this example, a table consists of the invoice number, invoice date, invoice amount, sales office where the sale originated, and other data. Suppose 10,000 records are inserted into this table every day, and users often query invoice amounts based on sales office. Sales office should be the column on which the clustered index is created because that is the range on which scans are based. Newly inserted rows will have a mix of sales offices; inserts should be spread evenly across the table and across the disks on which the table is located.

In some cases, range scans may not be an issue. For example, a very large employee table has employee number, social security number, and other data. As rows are inserted, employee number is incremented. There are 100,000 retrievals from this table every day and each retrieval is a single record fetch based on social security number. A nonclustered index created on social security number provides excellent query performance in this scenario. A clustered index on social security number provides slightly better query performance than the nonclustered index, but may be excessive because range scans are not involved. If there will be only one index on this table, place the clustered index on the social security number column. The question then is whether to define a clustered index on this table. In earlier versions of SQL Server, it is important to define a clustered index on a table even if it is not required for queries, because it helps with deleted row space recovery. This is not an issue with the SQL Server 7.0 space allocation algorithms and storage structures.

The recommendation in this example is to create the clustered index on social security number, because the social security number has data distributed so it does not follow the sequential pattern of employee number, and social security number tends to have an even distribution. If a clustered index is created on this evenly distributed column data, the employee records are evenly distributed on disk. This distribution, in conjunction with FILLFACTOR and PAD_INDEX, provides open data page areas throughout the table to insert data. Assuming that newly inserted employee records have an even distribution of social security numbers, the employee table fills evenly and page splitting is avoided. If a column with even distribution does not exist on the table, it is worthwhile to create an integer column on the table and populate the column with values that are evenly distributed and then create the clustered index column. This "filler" or "dummy" column with a clustered index defined on it is not being used to query, but to distribute data I/O across disk drives evenly to improve table access concurrency and overall I/O performance. This can be an effective methodology with large and heavily accessed SQL tables.

Another possible solution in this example is to refrain from creating a clustered index on this table. In this case, SQL Server 7.0 manages all aspects of space management. SQL Server finds a free space to insert the row, reuses space from deleted rows, and automatically reorganizes physical ordering of data pages on disk when it makes sense (to allow greater amounts of sequential I/O). The reorganization of data pages happens during database file autoshrink operations. For more information, see SQL Server Books Online.