Data Integrity and Concurrency in Microsoft SQL Server Version 6.0

Business Services Division
Microsoft Corporation

Introduction

In a single-user database, users can read and modify data without worrying about other users changing data at the same time. In high-volume, multi-user databases such as Microsoft® SQL Server, hundreds or even thousands of users may be executing queries and transactions against the same data concurrently.

Two key roles of the multi-user database are: (1) to ensure that multiple concurrent transactions can produce the same effect as they would executing in isolation, and (2) to coordinate changes to data in a way that is consistent with organization wide policies. A critical issue of multi-user database design and implementation is how to control data integrity and concurrency such that:

Many users can have coordinated access to the same data at the same time.
A user always sees a consistent view of the data and is prevented from making changes that would make the data inconsistent.
Data integrity is enforced within a single server and across multiple servers so that the data always reflects appropriate organizational and business policies.

If a database management system fails to implement the appropriate data concurrency, integrity, and consistency features, users can experience significant problems of lost updates, inconsistent data, and inaccurate results.

Microsoft SQL Server version 6.0 actively enforces data integrity and concurrency in distributed multi-user environments through high performance locking, server-enforced data integrity, and advanced distributed database capabilities. This brief discusses the Microsoft SQL Server architecture and its benefits for customers.

High Speed Locking and Concurrency Control

A multi-user environment must have the means to prevent concurrent transactions from interfering with each other. Microsoft SQL Server 6.0 employs a fast, sophisticated, automatic, and transparent locking scheme that arbitrates user requests. The process of obtaining a lock is exceptionally fast—a matter of microseconds—because lock information is stored in a memory-resident table. Microsoft SQL Server employs multiple levels of locking (depending on the degree of exclusion required) and uses the least-restrictive lock needed to protect data during a given operation.

Microsoft SQL Server 6.0 automatically acquires the necessary locks whenever a SQL statement is executed. Microsoft SQL Server uses:

Shared locks for read-only operations. If a shared lock has been applied to a data page or pages, a second transaction can also acquire a shared lock even though the first transaction has not been completed. By default, Microsoft SQL Server releases shared locks as soon as the data page or pages are no longer needed.
Exclusive locks for write operations: UPDATE, INSERT, DELETE. When an exclusive lock is set, no other transaction can acquire a lock of any kind until Microsoft SQL Server releases the original lock at the end of the transaction.
Update locks whenever a transaction intends to modify a page prior to a write operation. A specific locking optimization in Microsoft SQL Server 6.0, update locks reduce contention among multiple writers.

Within Microsoft SQL Server, the efficient use of automatic locking prevents destructive interference from concurrent transactions. By default, the changes made by the SQL statements only become visible to other transactions after those changes are committed.

Read Consistency

Microsoft SQL Server provides both statement-level and transaction-level read consistency. This produces repeatable reads, an important feature for long-running transactions with multiple queries that need to see the same image of the data throughout.

Microsoft SQL Server allows applications to achieve the best balance between read consistency and concurrency by supporting configurable transaction isolation levels. Transaction isolation levels determine how transactions interact with one another. The following table (Table 1) describes those isolation levels.

Table 1. Descriptions of Isolation Levels

Isolation Level	Description
READ UNCOMMITTED	Directs Microsoft SQL Server not to issue shared locks and does not honor exclusive locks during reads. At this level, applications may read values within uncommitted transactions that might get rolled back later ("dirty reads"), and therefore should be used only by transactions that are insensitive to this behavior.
READ COMMITTED	Directs Microsoft SQL Server to use shared locks while reading data (the default behavior). At this level, transactions cannot experience "dirty reads."
READ REPEATABLE READ SERIALIZABLE	Guarantees that data read repeatedly within a transaction will retain the same values and that "dirty reads," nonrepeatable reads, and "phantom" values (values that are rolled back after being read) cannot occur. READ REPEATABLE and READ SERIALIZABLE are interchangeable.

The developer can optimize this behavior through configurable transaction isolation levels that can be set globally for a session or individually for each SQL SELECT statement.

Scalable Lock Escalation

With databases now reaching hundreds of gigabytes in size, the database management system must be capable of scaling efficiently. Microsoft SQL Server 6.0 employs a scalable lock escalation strategy based on a percentage of table size, thus avoiding premature lock escalation for large tables. For example, when a transaction is performing a scanned UPDATE it can acquire a very large number of page locks, requiring escalation to a table-level lock at some point. Unlike some systems with a fixed escalation threshold (usually 200 pages), which would trigger a premature escalation for large tables, Microsoft SQL Server 6.0 allows this threshold to be configured by the database administrator (DBA). The DBA might configure the threshold as a percentage of table size (20 percent, for example), or as a specific number of pages. Microsoft SQL Server 6.0 can employ the optimal locking strategy for a table or database of any given size.

Automatic Deadlock Avoidance, Detection, and Correction

A deadlock is an example of a Catch-22 situation. A deadlock occurs when two (or more) transactions have a lock on a separate object. Each wants to acquire a lock on the other transaction's object. When this happens, the first transaction is waiting for the second transaction to release the lock, but the second transaction will not let go until the lock on the first transaction's object is freed. At such a point, both transactions are at an impasse and can't proceed with their work. A deadlock can occur when several long-running transactions execute concurrently in the same database.

Microsoft SQL Server 6.0 automatically avoids the most common types of deadlocks by eliminating or greatly reducing use of extent locks during updates, page splits, and index maintenance. If a deadlock does occur, Microsoft SQL Server detects and automatically resolves it by rolling back the first transaction that will break the deadlock for all other pending transactions (if there are more than two). In this way, the remaining transaction(s) obtains the needed lock and is free to continue. The application with the interrupted transaction resubmits the transaction at a later time, when it is likely to proceed without incident.

Microsoft SQL Server also automatically prevents lock starvation. Lock starvation occurs when read transactions monopolize a table or page, forcing a write transaction to wait indefinitely. Microsoft SQL Server prevents lock starvation by coordinating table and page accesses for read and write transactions in an intelligent way, based on a first-come, first-served strategy.

Row-Level vs. Page-Level Locking

In designing Microsoft SQL Server, Microsoft conducted extensive tests to devise an optimal locking scheme that results in superior overall performance. Page-level locking was selected because it effectively balances locking and deadlock detection overhead with transaction throughput for superior performance. Because all locks are managed in memory, lock acquisition is extremely fast. In the majority of business situations, page-level locking outperforms row-level locking, which is susceptible to excessive lock manager calls and frequent escalation to higher-level locks.

Oracle claims that row-level locking is superior to page-level locking because it provides higher concurrency. However, row-level locking implementations can suffer from significant drawbacks, including the following:

Excessive lock manager overhead. Row-level locking is more resource-intensive than page-level locking because a much larger number of locks must be acquired and maintained for a given set of transactions. In addition, row locks must be frequently escalated to page or extent locks whenever pages are split or reorganized, further increasing overhead and affecting performance.
Increased I/O overhead. Row locking systems can store lock status information in the data row itself, which means that each time a lock is acquired an input/output (I/O) call must be issued to the page on which the row is stored. This requirement can trigger physical I/Os solely to manage locks, leading to reduced performance and scalability.
Reduced OLTP performance. Because of the added overhead, row-level locking typically results in reduced performance in online transaction processing (OLTP) environments compared with page-level locking.

Superior Concurrency with High Performance

The real problem is not locking, but concurrency—locking is simply a means to an end. Microsoft SQL Server 6.0 employs advanced database techniques to provide high levels of concurrency, while preserving the performance characteristics of page-level locking. The most important of these include:

Optimistic concurrency control
High granularity locking
Clustered indexes

Optimistic Concurrency Control

Microsoft SQL Server 6.0 uniquely provides rich, scrollable, engine-based cursors with configurable concurrency modes, an implementation unavailable in other leading databases. The Microsoft SQL Server optimistic concurrency mode allows users to browse both forward and backward through database rows, and update data without locking any of the data being modified. Instead, Microsoft SQL Server uses sophisticated methods to detect whether the rows to be changed have been modified since the original retrieval. This protects applications against conflicting updates without locking large amounts of data, so decision support functions can continue to be performed.

Optimistic concurrency control works on the assumption that other transactions (usually) will not have changed the data being read. Because no locks are maintained on the current row(s) in the cursor, a high degree of concurrency is supported. Other transactions can freely read or write to the data. In the default ("optimistic") case, when an application performs an update through the cursor, it proceeds normally. However, if the row has been changed, Microsoft SQL Server detects this and rejects the update, automatically refreshing the row buffers at the client with the new row information. The application then has the option of reissuing the update using explicit locking.

For row-at-a-time operations, the Microsoft SQL Server engine-based cursors with optimistic concurrency control provide the highest level of concurrency available for decision support and data warehousing applications.

High-Granularity Locking

Microsoft SQL Server 6.0 employs advanced strategies to improve the level of locking granularity and performance to provide greater concurrency, compared with other page locking systems. There is now minimal extent-locking during index creation or allocation of pages from an extent. Locking is eliminated entirely for databases that are in single-user mode or defined as read-only. Microsoft SQL Server 6.0 employs new update locks that prevent multiple writers from trying to acquire exclusive locks at the same time. This significantly reduces deadlocks and retries, resulting in higher system throughput. Also, the lock manager itself has been made parallel and re-entrant for better scalability on symmetric multiprocessing (SMP) platforms.

Clustered Indexes

"Hot spotting" is a significant problem with traditional page-level locking designs. In database terms, a "hot spot" occurs when inserts become concentrated on the last page in the table, thus causing a bottleneck. This normally occurs when data is not ordered. Microsoft SQL Server uses clustered indexes to prevent this. In Microsoft SQL Server, clustered indexes store the data in index order by incorporating the data pages into the B-tree structure of the index itself. Clustered indexes offer significant benefits, such as reduced disk I/O and increased performance in retrieval-intensive applications. The desired data can typically be accessed in a single disk read, where standard SQL implementations require multiple reads. Clustered indexes prevent "hot spotting" because row insertions are automatically spread throughout the table, thus overcoming a weakness of other page locking designs. To minimize page splitting during inserts, Microsoft SQL Server allows the database administrator (DBA) to set a specific fill factor for a clustered index. The fill factor preserves and automatically maintains space in each page for new rows.

SAP R/3 and Microsoft SQL Server 6.0

Perhaps the best illustration that Microsoft SQL Server 6.0 offers superior concurrency with high performance is the R/3 software system from SAP AG. R/3 is one of the most demanding client-server applications used by some of the world's largest organizations. Today, R/3 runs only on a handful of high-end relational database systems, including Microsoft SQL Server 6.0 and Oracle7.

SAP originally designed R/3 for row-level locking databases because of its demanding concurrency requirements. It is not supported on other page-level locking databases such as SYBASE System 10. However, R/3 runs very well on Microsoft SQL Server 6.0, taking advantage of key Microsoft SQL Server features such as scrollable cursors, optimistic concurrency control, and configurable transaction isolation levels.

Server-Enforced Data Integrity

Data in a database must be safeguarded for accuracy, consistency, and authorized access. Microsoft SQL Server actively protects the accuracy and consistency of data within demanding multi-user database environments. Microsoft SQL Server addresses this key area with its programmable server, which allows organizationwide data integrity, business rules, and transactions to be programmed into the server and shared among all client applications.

The Microsoft SQL Server programmable server enforces data integrity within the database itself, guaranteeing that complex business policies will be followed. Because all rules are stored with the data and the single control point for data consistency is on the server, the need to keep these rules in every front-end application is eliminated. When rules change or new rules are created, they are encoded in only one place—the Microsoft SQL Server database itself.

In addition, Microsoft SQL Server 6.0 provides unique support for distributed data integrity, which ensures the accuracy and consistency of data distributed across multiple servers. By ensuring the integrity of data that is replicated or partitioned across servers, Microsoft SQL Server makes it much easier to deploy enterprisewide applications in a distributed environment.

The Microsoft SQL Server programmable server has many advantages. Application development costs can be greatly reduced because server-enforced business rules and transactions can be shared across all applications. Maintenance costs can be significantly reduced because business rules and transactions only have to be modified in one place (the server) instead of in every application using them. Database reliability can be increased because data integrity is maintained independently from application programs. Database security can also be increased because rules and transactions, as well as database tables, are protected from unauthorized access.

Types of Data Integrity

Enforcing data integrity involves preserving the consistency and correctness of data by validating the contents of individual columns, verifying column values with respect to one another, validating data in one table with another table, and verifying that a database is successfully and accurately updated for each transaction. Data integrity falls into five categories:

Entity integrity preserves the uniqueness of entities (rows) in a table. Entity integrity is usually enforced via the primary key of a table, but can also be enforced through unique indexes.
Domain integrity refers to the valid entries for a given column. Domain integrity is enforced by restricting the datatype, format, or range of possible values for a column.
Referential integrity maintains consistency between multiple tables of a database, as in "Do not delete customers with open orders." Referential integrity is enforced by checking the validity of data to ensure that it exists within a range of values in another table. In Microsoft SQL Server 6.0, referential integrity can be enforced declaratively, via constraints defined with the table, or programmatically, via database triggers and stored procedures.
User-defined integrity also allows organizationwide business rules to be actively enforced. Microsoft SQL Server supports user-defined integrity through its programmable server architecture. Critical business rules, such as "Don't ship to a customer with a bad credit history", can be enforced easily across all applications and users.
Distributed data integrity maintains the consistency and accuracy of data that is replicated or partitioned across multiple databases or multiple servers. This is a key requirement for reliable, highly available distributed systems.

Microsoft SQL Server 6.0 enforces data integrity actively in the server using methods that fully conform to the ANSI SQL-92 standard. In addition, Microsoft SQL Server provides sophisticated data integrity extensions such as user-defined datatypes, defaults, rules, stored procedures, and triggers. This gives developers the maximum in flexibility and power. The table below (Table 2) lists the Microsoft SQL Server 6.0 data integrity options.

Table 2. Data Integrity Options

Integrity type	Former options	Microsoft SQL Server 6.0 options
Entity	Unique Indexes	PRIMARY KEY UNIQUE constraint IDENTITY property Unique indexes
Domain	Datatypes, Defaults, Rules	DEFAULT constraint FOREIGN KEY constraint CHECK constraint NOT NULL constraint Datatypes, Defaults, Rules
Referential	Triggers	FOREIGN KEY constraint CHECK constraint Triggers
User-defined	Rules, Triggers, and Stored Procedures	All column- and table-level constraints in CREATE TABLE WITH CHECK OPTION on VIEWS Rules, Triggers, and Stored procedures

Single-Field (Domain) Integrity

Protecting the consistency of all data in a single column prevents data entry errors. Microsoft SQL Server supports several features for enforcing single data field integrity.

Datatypes

Datatypes determine the kind of information that can be stored in the columns of database tables. A rich set of system-supplied datatypes is provided—including binary large object (BLOB) datatypes to support storage of multimedia objects in the database. Microsoft SQL Server also supports international languages with alternative sort orders and alternative date, money, and number formats.

User-Defined Datatypes

Developers can also define their own datatypes to supplement the system-supplied datatypes. For example, us_state_type can be defined as a two-character datatype to store the legal abbreviations for each of the 50 states in the United States of America. The advantage of user-defined datatypes is that definitions of NULL values, defaults, and rules can be shared across many applications or tailored to specific applications. They also ensure consistency of data so tables can be joined together during transactions and queries.

Column-Level Constraints

Constraints conform to the ANSI standard for creating and altering tables. Constraints limit the possible values that users can enter into a column or columns. Multiple constraints can be associated with a single column, and a constraint can be associated with multiple columns. Constraints are defined when creating tables in SQL Enterprise Manager or with the CREATE TABLE statement. Microsoft SQL Server provides CHECK, DEFAULT, FOREIGN KEY, PRIMARY KEY, and UNIQUE constraints.

NULL Values

Microsoft SQL Server correctly distinguishes NULLs (non-entries) from any specific entry, including "0" for a number field or "blanks" for a character field.

Defaults

Defaults allow developers to specify a value when no value is explicitly entered for a specific field. For example, if a user does not enter a value in the order_date field, Microsoft SQL Server can automatically insert the current date.

Rules

Rules specify the domain of valid values for a specific field. They can specify a set of values, a range of values, or a format. For example, the field us_state_code can be restricted to the legal abbreviation of each of the 50 states in the United States of America.

The definition of a rule can contain any expression that is valid in a SQL WHERE clause (except that it cannot refer to column or table names). It can include arithmetic operators, relational operators, LIKE, BETWEEN, and so on. The same rule can be bound to more than one column or user datatype, and applications automatically use any rules that have been defined. Rules can be changed or dropped at any time.

Multi-Field and Multi-Table Integrity

Microsoft SQL Server also supports features to enforce complex business rules, including referential integrity. Referential integrity is the process of preserving defined relationships between tables when you enter or delete records in those tables. Referential integrity maintains consistency between the primary and foreign keys in tables, as in "Do not delete customers with open orders." Complex business rules also involve relationships between multiple data fields—often in multiple database tables—that must be enforced across all applications and users.

Declarative Referential Integrity

Microsoft SQL Server 6.0 provides full support for declarative referential integrity (DRI) as defined by the ANSI SQL-92 standard. DRI allows the DBA to define data integrity restrictions for a table as well as the relationship between tables, both of which are enforced by Microsoft SQL Server automatically. DRI preserves the defined relationships between tables when records are inserted, updated, or deleted. DRI is based on FOREIGN KEY, PRIMARY KEY, or UNIQUE constraint relationships, and it ensures automatically that related key values are consistent.

Views "With Check Option"

Microsoft SQL Server 6.0 supports the WITH CHECK OPTION on views, which ensures that users query and modify only the data visible through the view. This option forces all data modification statements executed against the view to adhere to the criteria set within the SELECT statement defining the view. Combined with the appropriate security privileges, views WITH CHECK OPTION provide a powerful way to ensure that specific users update only appropriate subsets of data.

Stored Procedures

Stored procedures are collections of Microsoft SQL Server or C language statements that are precompiled and stored at the server. (Microsoft SQL Server’s unique extended stored procedures allow developers to create powerful compiled objects in C or C++ that can be invoked identically to normal Transact-SQL stored procedures. Because extended stored procedures are stored externally in dynamic-link libraries (DLLs), they can interact with a wide variety of external applications and data sources, for example e-mail systems or online information feeds.) Developers can code complex queries and transactions into stored procedures and then invoke them directly from client programs or other servers over the network. In addition to offering performance benefits, stored procedures can be used to enforce organizationwide data integrity and business policies by ensuring that all updates occur in consistent and prescribed ways. Consider a funds transfer transaction in a bank: A developer could code the account credit and debit operations as well as the safety check logic. For example, the developer could ensure that the amount of money in the source account is greater than the amount to be transferred by coding a single stored procedure called transfer_funds. The bank teller's application would invoke this stored procedure each time a customer wanted to transfer funds between accounts.

In combination with Microsoft SQL Server advanced security, stored procedures offer particularly powerful ways to enforce data integrity in mixed-application environments. For example, Microsoft SQL Server can prevent users from doing ad hoc UPDATE, INSERT, and DELETE operations on tables, yet allow these operations to occur when they are part of a predefined stored procedure. This allows the DBA to control exactly when and how data is changed for each user or group of users in the database.

Triggers

Microsoft SQL Server centrally enforces complex business rules with a special type of stored procedure called a trigger. Whereas stored procedures must be explicitly called to be executed, triggers are automatically invoked, or triggered, by an attempt to insert, delete, or update a data field. By directly monitoring changes in the data, triggers implement "data-driven integrity" as opposed to "application-driven integrity."

Triggers prevent incorrect, unauthorized, or inconsistent changes to data. Triggers accomplish this by performing any number of actions whenever a specified change to a data object is attempted. Triggers can be nested: cascade a change throughout related tables in a database, roll back transactions, issue messages to an online user, and post alerts to the Windows NT® systemwide event log.

A key advantage of triggers is that they're automatic: they work no matter what caused the data modifications—a clerk's entry, a report calculation, or a front-end application's computations. Triggers are stored in the Microsoft SQL Server data dictionary, eliminating the need for redundant integrity code in applications.

Distributed Data Integrity

Distributed data integrity maintains the consistency and accuracy of data that is replicated or partitioned across multiple servers, a key requirement for reliable distributed systems. Microsoft SQL Server 6.0 provides key features for ensuring distributed data integrity.

Distributed Transactions with Two-Phase Commit

To guard against potential data inconsistency between multiple servers, Microsoft SQL Server performs distributed updates using a two-phase commit protocol that preserves the integrity of this type of transaction. Microsoft SQL Server was the first SQL database management system (DBMS) to implement this sophisticated capability. Using the Microsoft SQL Server programmable two-phase commit protocol, applications can update data in multiple servers within a single unit of work. If an update in any one server fails, the updates to all participating servers are automatically rolled back by the central coordinating server. This allows an organization to preserve the consistency of data across multiple servers when it is partitioned in a distributed environment.

Transaction-Based Replication

Microsoft SQL Server's built-in data replication enables organizations to maintain system-managed copies of business-critical information across multiple server systems. By maintaining redundant copies of critical information, replication provides for high availability in case of system failure, ensuring continuous access to critical information even if a central computing facility is unavailable. Replication also can be used to maintain "warm" standby servers to support rapid disaster recovery.

Because Microsoft SQL Server replicates data under transaction control, the integrity and consistency of data is guaranteed. Microsoft SQL Server 6.0 ensures integrity for replicated data with automated synchronization and recovery capabilities. In addition, Microsoft SQL Server supports optional encryption of the data stream to provide the highest level of security between distributed locations.

Automated Synchronization

In demanding distributed computing environments, a database server must ensure that the data being replicated between publishers and subscribers is accurate and consistent. Before data can be replicated effectively across servers, the servers must contain the same initial schema and data. The process of accomplishing this is called synchronization. Synchronization ensures that the table schema and the table data in the publisher and subscriber databases are identical and consistent with one another prior to enabling replication.

Unlike other database systems that require the data to be synchronized manually, Microsoft SQL Server 6.0 provides automatic synchronization. Microsoft SQL Server takes a snapshot of the table schema and data, then automatically transfers that snapshot between servers. Administrators can select from a variety of transfer methods depending on configuration.

Constraints for Replication

Preserving the update integrity of replicated data across servers is a key requirement in replicated databases. Data at the subscribing server must be identical to the primary copy of the data at the publishing server. To ensure this, most organizations require that modifications to replicated data be made on the primary server (the "owner" of the data) and not on the subscribing servers (each of which receives exact copies of the publishing server's data). This policy prevents potential data loss and inconsistencies due to conflicts and lost updates.

Microsoft SQL Server 6.0 uniquely supports this key requirement with constraints for replication, an extension to ANSI SQL that allows use of CHECK constraints to protect the integrity of replicated data. This extension is used to protect horizontally partitioned tables that receive source data from another server against unauthorized data modifications to the replicated data. In this way, organizations can ensure that valid updates occur only on the server that "owns" the original copy of the data. Constraints for replication are applied to both "before" and "after" images of updated records to prevent records from being added to or deleted from the replicated range. All deletions and insertions are checked; if they fall within the replicated range, they will be rejected.

Microsoft SQL Server 6.0 is the only DBMS to provide an automated constraint mechanism to enforce update integrity of replicated data; other systems either do not support this or require this logic to be coded into triggers, which can be a complex and error-prone approach.

Conclusion

Microsoft SQL Server 6.0 was engineered to meet the most demanding multi-user requirements for data integrity and concurrency:

The Microsoft SQL Server high performance locking and optimistic concurrency control gives users and developers the best balanced performance for both transaction processing and data warehousing.
The powerful Microsoft SQL Server programmable server enforces data integrity within the server itself, guaranteeing that complex organizationwide business policies are followed across all client applications and tools.
The unique support for Microsoft SQL Server distributed data integrity ensures the accuracy and consistency of data distributed across multiple servers, making it much easier to deploy enterprisewide applications in a distributed environment.