Query Optimization Techniques: Contrasting Various Optimizer Implementations with Microsoft SQL Server

Microsoft Corporation

Created: February 1992
"Related Readings" revised: February 1994

Overview

As companies began to rely more heavily on computerized business data, it became increasingly clear that the traditional file-based methods of storing and retrieving data were both inflexible and cumbersome to maintain. Because application code for accessing the data contained hard-coded pointers to the underlying data structures, a new report could take months to produce. Even minor changes were complicated and expensive to implement. In many cases, there was simply no method available for producing useful analysis of the data. These real business needs drove the relational database revolution.

The true power of a relational database resides in its ability to break the link between data access and the underlying data itself. Using a high-level access language such as SQL (structured query language), users can access all of their corporate data dynamically without any knowledge of how the underlying data is actually stored. To maintain both system performance and throughput, the relational database system must accept a diverse variety of user input queries and convert them to a format that efficiently accesses the stored data. This is the task of the query optimizer.

This technical article presents the steps involved in the query transformation process, discusses the various methods of query optimization currently being used, and describes the query optimization techniques employed by the Microsoft® relational database management system, SQL Server.

Query Transformation

Whenever a data manipulation language (DML) such as SQL is used to submit a query to a relational database management system (RDBMS), distinct process steps are invoked to transform the original query. Each of these steps must occur before the query can be processed by the RDBMS and a result set returned. This technical article deals solely with queries sent to RDBMS for the purpose of returning results; however, these steps are also used to handle DML statements that modify data and data definition language (DDL) statements that maintain objects within the RDBMS.

Although many texts on the subject of query processing disagree about how each process is differentiated, they do agree that certain distinct process steps must occur.

The Parsing Process

The parsing process has two functions:

These component parts are stored in an internal structure such as a graph or, more typically, a query tree. (This technical article focuses on a query tree structure.) A query tree is an internal representation of the component parts of the query that can be easily manipulated by the RDBMS. After this tree has been produced, the parsing process is complete.

The Standardization Process

Unlike a strictly hierarchical system, one of the great strengths of an RDBMS is its ability to accept high-level dynamic queries from users who have no knowledge of the underlying data structures. As a result, as individual queries become more complex, the system must be able to accept and resolve a large variety of combinational statements submitted for the purpose of retrieving the same data result set.

The purpose of the standardization process is to transform these queries into a useful format for optimization. The standardization process applies a set of tree manipulation rules to the query tree produced by the parsing process. Because these rules are independent of the underlying data values, they are correct for all operations. During this process, the RDBMS rearranges the query tree into a more standardized, canonical format. In many cases, it completely removes redundant syntax clauses. This standardization of the query tree produces a structure that can be used by the RDBMS query optimizer.

The Query Optimizer

The goal of the query optimizer is to produce an efficient execution plan for processing the query represented by a standardized, canonical query tree. Although an optimizer can theoretically find the "optimal" execution plan for any query tree, an optimizer really produces an acceptably efficient execution plan. This is because the possible number of table join combinations increases combinatorially as a query becomes more complex. Without using pruning techniques or other heuristical methods to limit the number of data combinations evaluated, the time it takes the query optimizer to arrive at the best query execution plan for a complex query can easily be longer than the time required to use the least efficient plan.

Various RDBMS implementations have used differing optimization techniques to obtain efficient execution plans. This section discusses some of these techniques.

Heuristic Optimization

Heuristic optimization is a rules-based method of producing an efficient query execution plan. Because the query output of the standardization process is represented as a canonical query tree, each node of the tree maps directly to a relational algebraic expression. The function of a heuristic query optimizer is to apply relational algebraic rules of equivalence to this expression tree and transform it into a more efficient representation. Using relational algebraic equivalence rules ensures that no necessary information is lost during the transformation of the tree.

These are the major steps involved in heuristic optimization:

  1. Break conjunctive selects into cascading selects.

  2. Move selects down the query tree to reduce the number of returned "tuples." ("Tuple" rhymes with "couple." In a database table (relation), a set of related values, one for each attribute (column). A tuple is stored as a row in a relational database management system. It is the analog of a record in a nonrelational file. [Definition from Microsoft Press Computer Dictionary, 1991.])

  3. Move projects down the query tree to eliminate the return of unnecessary attributes.

  4. Combine any Cartesian product operation followed by a select operation into a single join operation.

When these steps have been accomplished, the efficiency of a query can be further improved by rearranging the remaining select and join operations so that they are accomplished with the least amount of system overhead. Heuristic optimizers, however, do not attempt this further analysis of the query.

Syntactical Optimization

Syntactical optimization relies on the user's understanding of both the underlying database schema and the distribution of the data stored within the tables. All tables are joined in the original order specified by the user query. The optimizer attempts to improve the efficiency of these joins by identifying indexes that are useful for data retrieval. This type of optimization can be extremely efficient when accessing data in a relatively static environment. Using syntactical optimization, indexes can be created and tuned to improve the efficiency of a fixed set of queries. Problems occur with syntactical optimization whenever the underlying data is fairly dynamic. Query access schemas can be degraded over time, and it is up to the user to find a more efficient method of accessing the data. Another problem is that applications using embedded SQL to query dynamically changing data often need to be recompiled to improve their data access performance. Cost-based optimization was developed to resolve these problems.

Cost-Based Optimization

To perform cost-based optimization, an optimizer needs specific information about the stored data. This information is extremely system-dependent and can include information such as file size, file structure types, available primary and secondary indexes, and attribute selectivity (the percentage of tuples expected to be retrieved for a given equality selection). Because the goal of any optimization process is to retrieve the required information as efficiently as possible, a cost-based optimizer uses its knowledge of the underlying data and storage structures to assign an estimated cost in terms of numbers of tuples returned and, more importantly, physical disk I/O for each relational operation. By evaluating various orderings of the relational operations required to produce the result set, a cost-based optimizer then arrives at an execution plan based on a combination of operational orderings and data access methods that have the lowest estimated cost in terms of system overhead.

As mentioned earlier, the realistic goal of a cost-based optimizer is not to produce the "optimal" execution plan for retrieving the required data, but is to provide a reasonable execution plan. For complex queries, the cost estimate is based on the evaluation of a subset of all possible orderings and on statistical information that estimates the selectivity of each relational operation. These cost estimates can be only as accurate as the available statistical data. Due to the overhead of keeping this information current for data that can be altered dynamically, most relational database management systems maintain this information in system tables or catalogs that must be updated manually. The database system administrator must keep this information current so that a cost-based optimizer can accurately estimate the cost of various operations.

Semantic Optimization

Although not yet an implemented optimization technique, semantic optimization is currently the focus of considerable research. Semantic optimization operates on the premise that the optimizer has a basic understanding of the actual database schema. When a query is submitted, the optimizer uses its knowledge of system constraints to simplify or to ignore a particular query if it is guaranteed to return an empty result set. This technique holds great promise for providing even more improvements to query processing efficiency in future relational database systems.

The Microsoft SQL Server Query Optimizer

The Microsoft SQL Server database engine uses a cost-based query optimizer to automatically optimize data manipulation queries that are submitted using SQL. (A data manipulation query is any query that supports the WHERE or HAVING keywords in SQL; for example, SELECT, DELETE, and UPDATE.)

This optimization is accomplished in three phases:

Query Analysis

In the query analysis phase, the SQL Server optimizer looks at each clause represented by the canonical query tree and determines whether it can be optimized. SQL Server attempts to optimize clauses that limit a scan; for example, search or join clauses. However, not all valid SQL syntax can be broken into optimizable clauses, such as clauses containing the SQL relational operator <> (not equal). Because <> is an exclusive rather than an inclusive operator, the selectivity of the clause cannot be determined before scanning the entire underlying table. When a relational query contains non-optimizable clauses, the execution plan accesses these portions of the query using table scans. If the query tree contains any optimizable SQL syntax, the optimizer performs index selection for each of these clauses.

Index Selection

For each optimizable clause, the optimizer checks the database system tables to see if there is an associated index useful for accessing the data. An index is considered useful only if a prefix of the columns contained in the index exactly matches the columns in the clause of the query. This must be an exact match, because an index is built based on the column order presented at creation time. For a clustered index, the underlying data is also sorted based on this index column order. Attempting to use only a secondary column of an index to access data would be similar to attempting to use a phone book to look up all the entries with a particular first name: the ordering would be of little use because you would still have to check every row to find all of the qualifying entries. If a useful index exists for a clause, the optimizer then attempts to determine the clause's selectivity.

In the earlier discussion on cost-based optimization, it was stated that a cost-based optimizer produces cost estimates for a clause based on statistical information. This statistical information is used to estimate a clause's selectivity (the percentage of tuples in a table that are returned for the clause). Microsoft SQL Server stores this statistical information in a specialized data distribution page associated with each index.

This statistical information is updated only at the following two times:

To provide SQL Server with accurate statistics that reflect the actual tuple distribution of a populated table, the database system administrator must keep the statistical information for the table indexes reasonably current. If no statistical information is available for the index, a heuristic based on the relational operator of the clause is used to produce an estimate of selectivity.

Information about the selectivity of the clause and the type of available index is used to calculate a cost estimate for the clause. SQL Server estimates the amount of physical disk I/O that occurs if the index is used to retrieve the result set from the table. If this estimate is lower than the physical I/O cost of scanning the entire table, an access plan that employs the index is created.

Join Selection

When index selection is complete and all clauses have an associated processing cost based on their access plan, the optimizer performs join selection. Join selection is used to find an efficient order for combining the clause access plans. To accomplish this, the optimizer compares various orderings of the clauses and then selects the join plan with the lowest estimated processing costs in terms of physical disk I/O. Because the number of clause combinations can grow combinatorially as the complexity of a query increases, the SQL Server query optimizer uses tree pruning techniques to minimize the overhead associated with these comparisons. When this join selection phase is complete, the SQL Server query optimizer provides a cost-based query execution plan that takes advantage of available indexes when they are useful and accesses the underlying data in an order that minimizes system overhead and improves performance.

Summary

This technical article has shown you the steps required for a relational database management system to process a high-level query. It has discussed the need for query optimization and has shown several different methods of achieving query optimization. Finally, it has illustrated the various phases of optimization employed by the cost-based optimizer of the Microsoft RDBMS, SQL Server. We hope this document has helped you gain a better understanding of both the query optimization process and the Microsoft cost-based query optimizer, one of the many features that clearly define SQL Server as the premier database server for the PC environment.

Related Readings

Date, C. J. An Introduction to Database Systems. Volume I. Addison/Wesley, 1990, 455–473.

Elmasri, R., and S. B. Navathe. Fundamentals of Database Systems. Benjamin/Cummings, 1989, 501–532.

Moffatt, Christopher. "Designing Client-Server Applications for Enterprise Database Connectivity." MSDN Library, Technical Articles.

Moffatt, Christopher. "Microsoft SQL Server Network Integration Architecture." MSDN Library, Technical Articles.

"Microsoft Open Data Services: Application Sourcebook." MSDN Library, Technical Articles.

Narayanan, Suriya. "Maximizing Performance Using Binary Columns and Bitwise Operations in Microsoft SQL Server for Windows NT." MSDN Library, Technical Articles.

Shelly, D. B. "Understanding the Microsoft SQL Server Optimizer." Microsoft Networking Journal, Vol. 1, No. 1, January 1991.

Schroeder, Gary. "Backup and Recovery Guidelines for Microsoft SQL Server." MSDN Library, Technical Articles.

Yao, S. B. "Optimization of Query Evaluation Algorithms." ACM TODS, Vol. 4, No. 2, June 1979.

Additional Information

To receive more information about Microsoft SQL Server or to have other technical notes faxed to you, call Microsoft Developer Services Fax Request at (206) 635-2222.