INF: Understanding the Microsoft SQL Server Optimizer

ID Number: Q71052

1.10

OS/2

Summary:

The following information is reprinted with permission from The Cobb

Group's "Microsoft Networking Journal" Ó, January 1991. For a sample

issue of the journal, call The Cobb Group at (800) 223-8720.

======================================================================

Understanding the Microsoft SQL Server Optimizer

======================================================================

One of the most important features of Microsoft's new SQL Server 1.1

database engine is its ability to automatically optimize data

manipulation queries. A data manipulation query is any type of query

that supports the WHERE or HAVING key words in SQL (Structured Query

Language), such as SELECT, DELETE, and UPDATE. The Microsoft SQL

Server query optimizer finds the most cost efficient query execution

plan for manipulating the data. The optimizer also must accomplish the

optimization process itself within an acceptable length of time. It

does this in three phases:

o Query Analysis - breaks the query down into different types of

clauses.

o Index Selection - finds which indexes are useful for each of the

different clauses, and determines whether they are cost effective

enough to use.

o Join Selection - looks at all possible plans, estimates their

associated costs, and saves the most cost effective plan for

execution.

To have a clear understanding of how the optimization process for SQL

Server actually works, we need to look at each of these phases in

detail.

Query Analysis

--------------

Query analysis breaks down the initial query that has been input by

looking for clauses within the query that it can optimize, such as the

search clause

column_1 = 45

or the join clause

table1.column_2 > table2.column_4

SQL Server can optimize clauses that limit a scan. You can use several

relational operators that produce perfectly valid SQL queries, but

they will contain some non-optimizable clauses. In these cases, the

optimizer uses indexes for the portions of the query that it can break

into optimizable clauses, but SQL Server will execute the

non-optimizable portions using table scans.

Operators that are difficult to optimize include NOT = and OR. Because

NOT = is an exclusive rather than an inclusive relational operator, it

cannot be used to limit a scan. The entire table must be scanned to

determine which tuples qualify. The query analyzer cannot optimize any

clause that includes NOT =. If you use OR to join conditions for

columns in different tables, the optimizer cannot logically use it to

limit a scan and therefore cannot accomplish optimization.

Index Selection

---------------

If input queries contain any optimizable SQL syntax, the query will be

broken into its component clauses, and the optimizer will then perform

index selection for the optimizable clauses. Before SQL Server will

use a table index with the associated clauses, it must pass three

tests during index selection.

Is the index useful?

First, the index must be useful. This implies that there is a

description of the index in the database System tables, and that the

columns in the clause match a prefix of the columns in the index

itself. In order to better understand this, let's use the example of a

phone book. Let's assume you used SQL syntax to create an index on a

phone book table called phone_book that contained, among other

columns, last_name and first_name for every table entry:

create index name_index

on phone_book (last_name, first_name)

This index would be useful for performing a selection that covered the

entire index, i.e., a selection including both last_name and

first_name. It also would be useful for a selection that matches a

prefix of the index, i.e., a selection including just last_name, such

as looking up all of the people with the last name of "Smith". It

would not, however, be useful for a selection that did not match a

prefix of the index, such as a selection on first_name. In other

words, this index would not help you find all the people in the phone

book with the first name of "George". You would still need to start at

the beginning of the book and read each entry to determine if it met

the qualifications. The obvious point here is that by analyzing the

types of queries that are anticipated on a particular table and

providing useful indexes, a developer can help SQL Server improve

performance and efficiently optimize more ad hoc queries.

How selective is the clause?

If the optimizer finds a useful index for a query clause, index

selection can satisfy the second test: determining the selectivity of

the clause. Selectivity is an estimate of the percentage of the tuples

in a table that will be returned for the clause. This is one of the

most useful, and often underused features of SQL Server. Each index

has an associated distribution page that stores index statistics.

These statistics can provide a good estimate of the selectivity of the

clause, but only if the information contained on the index

distribution page is accurate and up-to-date for the table.

If a developer establishes an index when creating a table -- before

any data has been entered -- there will be no associated distribution

page created for the index. Simply entering data into the table will

not cause SQL Server to create statistical data. Because no

statistical information is available, the optimizer will use

standardized selectivity values based on the type of relational

operator used in the clause. While these estimated selectivity values

will help the optimizer decide whether or not to use the associated

indexes for a clause, they are average values that may not reflect the

actual distribution of the table. In order to provide SQL Server with

accurate statistics that reflect the actual tuple distribution of a

populated table, the developer or system administrator needs to update

the statistical information for the table by executing the SQL Server

UPDATE STATISTICS command. For our phone_book example, this would be

accomplished by executing either

update statistics phone_book name_index

to update the statistics for the single index name_index, or

update statistics phone_book

to update the statistical information for every index associated with

the table phone_book. Once the developer or administrator establishes

statistics for a particular table, the optimizer will use them to

determine selectivity for clauses.

For dynamic tables where data distribution is changed through

insertions, deletions, and updates, the statistical information may

become invalid over time. The system administrator is responsible for

understanding the uses of the various objects in the database and

executing the UPDATE STATISTICS command as appropriate to keep the

statistical information current for the table. The frequency and type

of usage incurred by each object will determine the frequency with

which this needs to be done.

How much does it cost ?

Index selection will use the estimated selectivity of each clause and

the type of index that is available to calculate the cost in terms of

page accesses for the clause. This is the final test in terms of index

selection. If the estimated cost of using an available index for the

clause is less than a straight table scan of the data, the optimizer

will use the index. This is another area where analysis of the

anticipated queries on a table can allow the developer to assist the

SQL Server optimizer.

A standard non-clustered index, at the leaf level, contains a series

of pointers to the associated data pages in the table that stores the

rows. Because there is no relationship between a non-clustered index

and the order in which the data is stored in the table, the cost of

using a non-clustered index includes an additional page access for

each qualifying row. As a result of this associated page access

overhead, when the selectivity of a clause reaches 10 percent to 15

percent of the table size of a smaller table, it is less expensive for

the optimizer to execute a table scan than to use the index to access

the data. There are, however, several index characteristics a

developer should consider that can improve the performance of indexed

data retrieval and potentially improve throughput for data

manipulation queries.

If a non-clustered index covers a query (i.e., both the selection and

conditional column information are part of the created index), there

will not be an additional page access cost associated with the use of

the index. In our phone_book example above, the name_index index would

cover the query

select last_name

from phone_book

where first_name = "George"

and therefore would perform more efficiently for this query and

improve the data retrieval performance.

Each table may have one clustered index. Because there is a one-to-one

correspondence between a clustered index and the order in which data

is stored in the table (in fact, the leaf level of a clustered index

maps directly to the stored data pages), there is no additional page

access cost associated with this index for each qualifying row. For

tables where a clustered index is appropriate, this can again improve

data manipulation query performance. A clustered index forces the

table to store tuples in sorted order, so there is both sorting and

page splitting overhead associated with inserts, deletes, and updates

that may make this type of index inappropriate for either extremely

dynamic or extremely large tables that are subject to modification.

Expansion overhead for a clustered index can be reduced using SQL

Server's FILLFACTOR feature. Using FILLFACTOR reserves a percentage of

free space on each data page of the index to accommodate future

expansion. This is a tuning issue that the developer and system

administrator will need to address when considering the use of

clustered indexes.

Finally, if a unique index is created for the table so that no

duplicate index values can be inserted, there is a special case that

reduces processing costs. If all of the keys in the WHERE clause use

the equality relational operator and are contained in the unique

index, the optimizer knows that this query can match only one row in

the table. In this case there is a minimal fixed cost involved in

using the unique index to process the query.

Once all of the clauses have been processed for index selection and

have an associated processing time, the optimizer is ready to perform

join selection. The optimizer compares all combinations of the

clauses, then selects the join plan with the lowest estimated

processing costs. Because the combinations of the clauses can grow

exponentially as the complexity of a query increases, SQL Server uses

tree pruning techniques to minimize the overhead associated with these

comparisons.

Using Tuning Tools

------------------

SQL Server includes several tuning tools that allow users to view this

internal optimization process. After executing the command

set showplan on

All queries sent to SQL Server will return not only the expected

results, but also a description of the query execution plan the

optimizer used to produce these results. For each step of the

execution plan, this description shows whether the optimizer decided

to use an existing index or a table scan to retrieve the results.

After SQL Server executes the command

set statistics time on

it displays the time taken to parse and compile each statement, along

with the time that it took to execute each step in the query execution

plan. These two commands used together can help users modify their

queries and index structures to further improve the performance of the

SQL Server database engine.

As you can see from this discussion of SQL Server optimization, there

are no hard and fast rules -- such as "when the number of rows exceeds

100" -- that you can use to determine when SQL Server will take

advantage of indexes to assist in query processing. However, by

understanding the underlying processes of the optimizer and analyzing

the types of queries being performed on the tables, you can create

specific indexes for your data tables that will help your SQL Server

optimizer improve system throughput.