To Sequence or Not To Sequence

Vic Bachulis

In this article, Vic explores the art of selecting keys and offers surrogate keys as an alternative to traditional column-based keys.

As it was once so eloquently stated by SQL guru Chris Date, a key value should represent "the key, the whole key, and nothing but the key." As a database designer, you're confronted with a million little details that can ultimately seriously affect your system's (or your client's) performance and functionality at every turn. Nowhere is this more apparent than in the design of your table keys. Relational databases are wholly predicated on the concept of keys and key values, but how can you best implement them in your own database design?

There tend to be two major schools of thought in regards to database keys. The first believes that keys should be made up of those attributes of an entity that guarantee uniqueness for each row. The other school believes that keys should be made up of integers and treated as an abstraction layer that has nothing to do with the data and everything to do with the data's integrity. Stated another way, the second school believes in surrogate keys. Both schools of thought are legitimate, although the surrogate key school offers some flexibilities that the first doesn't.

As an example, suppose you have the Social Security Number field defined as your primary key. [See Mike Yocca's warnings about SSNs, "SQL and SSNs: Think Again," in the June 1997 issue. -- Ed.] Ordinarily, this might be a good choice. It can be defined as an integer, it's supposed to be unique, it could be a great clustered value for searches, and it's easy to work with. But what happens if a mistake was made when a value was entered? And what if that value has been propagated to a half-dozen tables? Ordinarily, you'd just update the incorrect information. The problems can start, however, with cascading updates and referential integrity (RI). On a really complex system, you could get yourself boxed in, where dependencies between tables cause major update issues -- that is, incorrect RI. Theoretically, these should have all been tested and worked out, but bugs happen. You might also run into concurrency issues on a busy system when you have complicated relationships based on inherited key values and you attempt to update those keys, even though you should never update your table keys!

Now, let's say you'd used surrogate key IDs instead. If the SSN isn't the key value, but just an attribute of a record, all you need to do is change a single record in a single table. You can still enforce uniqueness by creating an alternate key on the SSN, but you no longer have to wrestle with RI and concurrency issues. You need only update the problem record in one location -- assuming, of course, that it lives in one place in the database. This abstraction away from the data can be a compelling rationale for using a surrogate key. You achieve the same functionality as you would if you used user data like an SSN for your key, but your system isn't dependent on such user data. You've essentially hidden the keys to the data from the user, which I think is a good thing. We need to have keys, but unless they're used for querying purposes, I don't believe users should need to know about them.

To facilitate the use of surrogate keys, Microsoft has given developers a powerful tool in the form of identity columns. When a row is inserted into a table that has a defined identity column, SQL Server alone is responsible for generating the next ID value -- simple, efficient, and quick. The problem with identity columns is that they aren't a good cross-platform solution and won't work with other databases that don't support identity columns in the same way that Microsoft does. You can also run into concurrency issues on long-running transactions that insert data into tables that are defined with identity columns.

Sequence numbers
A common approach to handling surrogate key generation is through the use of a sequence number table. This approach offers a high degree of flexibility, good scalability, high concurrency when used properly, and relatively good ease of implementation and use. The trade-off is that it does require more code for managing and handling the IDs that are generated. When a unique ID is required, you'll invoke a single procedure and capture the returned sequence number, which is then used in your INSERT statement, as in the following example:

 DECLARE @NextID Int 
 EXEC @NextID = sp_GetNextSequenceNumber @TableName  
 = 'Customer' 
 INSERT INTO CUSTOMER (CustomerID,...)  
    VALUES (@NextID,...)

A typical and all-too-common implementation of a sequence number table is shown in Table 1. When a unique ID is asked for by the calling procedure, the GET routine for the ID is passed an @TableName variable, which is then used to locate, retrieve, and update the correct record in the sequence number table.

Table_Name Sequence_Number

Customer 005

Table 1: Incorrectly designed sequence number table.

The major problem with this design of a sequence number table is page level contention. With the advent of SQL Server 7.0, this won't be so much of an issue, but for versions 4.2 to 6.5, it is. The easiest way to manage this issue is through the use of pseudo-row level locking. The ultimate goal is to have a single row per database page. This is achieved by adding "pad" columns that, when sized accordingly and summed with all the other column sizes in the sequence number table, leave exactly one row per page, as seen in Table 2.

Table_Name Sequence_Number Pad1 Pad2 Padn

Customer 005 SPACE(255) SPACE(255) SPACE(255)

Table 2: Revised sequence number table.

SQL Server 6.5 supports 1,962 bytes of data per row, excluding text and image columns. Given this fact, you'd have the following column lengths and table structure. When you insert your data into your sequence number table, you'll use SPACE(255) to fill out your pad fields. All table fields are declared as NOT NULL:

Column Datatype Comment --------------- --------- ---------------------------- Table_Name Char(30) 30 bytes-max tablename size Sequence_Number Int Four bytes for Int Pad1 - Pad7 Char(255) 255*7+34 = 1785 bytes so far Pad8 Char(141) 1962 bytes = one page INSERT INTO SEQUENCE_NUMBER (Table_Name, Sequence_Number, Pad1,...) VALUES ('Customer',005,SPACE(255)...)

When you execute sp_spaceused on the Sequence_Number table after you've inserted a few rows of data, you should see something like the following. The output has been abbreviated to only the relevant columns:

name rows data index_size -------------------- ------ ----- ---------- Sequence_Number 3 6 KB 4 KB

The final key to this structure is to create a clustered index on the Table_Name field. (If you don't do this, SQL will place exclusive table locks on the table, as shown in the sp_lock output that follows.) In this example, I ran sp_lock immediately after two BEGIN TRANSACTION commands (same SPID) that attempted to get the next ID for two different table names. No clustered index exists for the table:

spid locktype table_id page dbname ------ ------------- ----------- ------ ---------------- 19 Sh_intent 688005482 0 master 19 Ex_table 2019538278 0 Customer_Service 19 Ex_extent 0 368 tempdb

With the clustered index in place, you should observe something similar to the (abbreviated) output here:

spid locktype table_id page dbname ------ ------------- ----------- ------ --------------- 19 Ex_intent 2019538278 0 Customer_Service 19 Ex_page 2019538278 169401 Customer_Service 19 Ex_page 2019538278 169402 Customer_Service 19 Update_page 2019538278 169401 Customer_Service 19 Update_page 2019538278 169402 Customer_Service

(In SQL 7.0 [beta 3], it's no longer necessary to create pad columns -- creating your Table_Name and Sequence_Number columns in conjunction with a clustered index on the table name is sufficient to achieve the same behavior.)

Locking
The other thing you must consider when using a sequence number table is transaction control and transaction scope. To facilitate the highest possible throughput on such a critical table, you must minimize the locking executed against the table. For two-tier applications, you must keep your code that gets the next unique ID out of the BEGIN/COMMIT transaction pair, because otherwise, you're effectively tying up a page in the sequence number table while executing a transaction on an unrelated table or tables! This is precisely what will occur on a table with an identity column. While one transaction is running, all other transactions will wait for the completion of the blocking transaction before they can insert a row into the table with the identity column. For three-tier apps, running under Microsoft Transaction Server (MTS), for example, you'll need to get your IDs immediately prior to beginning an MTS transaction. This is extra work, but the payoff will be seen in the transactions per second (tps) throughput.

The final twist to using sequence number tables is the case where there's extreme contention for a single table value ID. In some cases, it might be desirable to actually split out a single sequence number record into its own table, apart from the other sequence number table records. By making it a standalone table, you can further isolate a specific sequence number from all the "roar and din" created by updating all the other sequence number records in a single table. This table could also be placed on a little-used database device, thus further reducing contention issues for really extreme cases.

In this article, I've presented an alternative key-generation method and outlined some of the major pitfalls to avoid. When making decisions about keys in your own database, it might pay to consider this alternative method rather than just blindly using an identity column.

Vic Bachulis is the president and owner of Irvine, CA-based Rent-a-DBA, Inc. and specializes in SQL Server tuning, operations, and programming. www.rentadba.com. vicb@rentadba.com.

Table_Name	Sequence_Number	Pad1	Pad2	Padn
Customer	005	SPACE(255)	SPACE(255)	SPACE(255)