ID Number: Q71052
1.10
OS/2
Summary:
The following information is reprinted with permission from The Cobb
Group's "Microsoft Networking Journal" Ó, January 1991. For a sample
issue of the journal, call The Cobb Group at (800) 223-8720.
======================================================================
Understanding the Microsoft SQL Server Optimizer
======================================================================
One of the most important features of Microsoft's new SQL Server 1.1
database engine is its ability to automatically optimize data
manipulation queries. A data manipulation query is any type of query
that supports the WHERE or HAVING key words in SQL (Structured Query
Language), such as SELECT, DELETE, and UPDATE. The Microsoft SQL
Server query optimizer finds the most cost efficient query execution
plan for manipulating the data. The optimizer also must accomplish the
optimization process itself within an acceptable length of time. It
does this in three phases:
o Query Analysis - breaks the query down into different types of
clauses.
o Index Selection - finds which indexes are useful for each of the
different clauses, and determines whether they are cost effective
enough to use.
o Join Selection - looks at all possible plans, estimates their
associated costs, and saves the most cost effective plan for
execution.
To have a clear understanding of how the optimization process for SQL
Server actually works, we need to look at each of these phases in
detail.
Query Analysis
--------------
Query analysis breaks down the initial query that has been input by
looking for clauses within the query that it can optimize, such as the
search clause
column_1 = 45
or the join clause
table1.column_2 > table2.column_4
SQL Server can optimize clauses that limit a scan. You can use several
relational operators that produce perfectly valid SQL queries, but
they will contain some non-optimizable clauses. In these cases, the
optimizer uses indexes for the portions of the query that it can break
into optimizable clauses, but SQL Server will execute the
non-optimizable portions using table scans.
Operators that are difficult to optimize include NOT = and OR. Because
NOT = is an exclusive rather than an inclusive relational operator, it
cannot be used to limit a scan. The entire table must be scanned to
determine which tuples qualify. The query analyzer cannot optimize any
clause that includes NOT =. If you use OR to join conditions for
columns in different tables, the optimizer cannot logically use it to
limit a scan and therefore cannot accomplish optimization.
Index Selection
---------------
If input queries contain any optimizable SQL syntax, the query will be
broken into its component clauses, and the optimizer will then perform
index selection for the optimizable clauses. Before SQL Server will
use a table index with the associated clauses, it must pass three
tests during index selection.
Is the index useful?
First, the index must be useful. This implies that there is a
description of the index in the database System tables, and that the
columns in the clause match a prefix of the columns in the index
itself. In order to better understand this, let's use the example of a
phone book. Let's assume you used SQL syntax to create an index on a
phone book table called phone_book that contained, among other
columns, last_name and first_name for every table entry:
create index name_index
on phone_book (last_name, first_name)
This index would be useful for performing a selection that covered the
entire index, i.e., a selection including both last_name and
first_name. It also would be useful for a selection that matches a
prefix of the index, i.e., a selection including just last_name, such
as looking up all of the people with the last name of "Smith". It
would not, however, be useful for a selection that did not match a
prefix of the index, such as a selection on first_name. In other
words, this index would not help you find all the people in the phone
book with the first name of "George". You would still need to start at
the beginning of the book and read each entry to determine if it met
the qualifications. The obvious point here is that by analyzing the
types of queries that are anticipated on a particular table and
providing useful indexes, a developer can help SQL Server improve
performance and efficiently optimize more ad hoc queries.
How selective is the clause?
If the optimizer finds a useful index for a query clause, index
selection can satisfy the second test: determining the selectivity of
the clause. Selectivity is an estimate of the percentage of the tuples
in a table that will be returned for the clause. This is one of the
most useful, and often underused features of SQL Server. Each index
has an associated distribution page that stores index statistics.
These statistics can provide a good estimate of the selectivity of the
clause, but only if the information contained on the index
distribution page is accurate and up-to-date for the table.
If a developer establishes an index when creating a table -- before
any data has been entered -- there will be no associated distribution
page created for the index. Simply entering data into the table will
not cause SQL Server to create statistical data. Because no
statistical information is available, the optimizer will use
standardized selectivity values based on the type of relational
operator used in the clause. While these estimated selectivity values
will help the optimizer decide whether or not to use the associated
indexes for a clause, they are average values that may not reflect the
actual distribution of the table. In order to provide SQL Server with
accurate statistics that reflect the actual tuple distribution of a
populated table, the developer or system administrator needs to update
the statistical information for the table by executing the SQL Server
UPDATE STATISTICS command. For our phone_book example, this would be
accomplished by executing either
update statistics phone_book name_index
to update the statistics for the single index name_index, or
update statistics phone_book
to update the statistical information for every index associated with
the table phone_book. Once the developer or administrator establishes
statistics for a particular table, the optimizer will use them to
determine selectivity for clauses.
For dynamic tables where data distribution is changed through
insertions, deletions, and updates, the statistical information may
become invalid over time. The system administrator is responsible for
understanding the uses of the various objects in the database and
executing the UPDATE STATISTICS command as appropriate to keep the
statistical information current for the table. The frequency and type
of usage incurred by each object will determine the frequency with
which this needs to be done.
How much does it cost ?
Index selection will use the estimated selectivity of each clause and
the type of index that is available to calculate the cost in terms of
page accesses for the clause. This is the final test in terms of index
selection. If the estimated cost of using an available index for the
clause is less than a straight table scan of the data, the optimizer
will use the index. This is another area where analysis of the
anticipated queries on a table can allow the developer to assist the
SQL Server optimizer.
A standard non-clustered index, at the leaf level, contains a series
of pointers to the associated data pages in the table that stores the
rows. Because there is no relationship between a non-clustered index
and the order in which the data is stored in the table, the cost of
using a non-clustered index includes an additional page access for
each qualifying row. As a result of this associated page access
overhead, when the selectivity of a clause reaches 10 percent to 15
percent of the table size of a smaller table, it is less expensive for
the optimizer to execute a table scan than to use the index to access
the data. There are, however, several index characteristics a
developer should consider that can improve the performance of indexed
data retrieval and potentially improve throughput for data
manipulation queries.
If a non-clustered index covers a query (i.e., both the selection and
conditional column information are part of the created index), there
will not be an additional page access cost associated with the use of
the index. In our phone_book example above, the name_index index would
cover the query
select last_name
from phone_book
where first_name = "George"
and therefore would perform more efficiently for this query and
improve the data retrieval performance.
Each table may have one clustered index. Because there is a one-to-one
correspondence between a clustered index and the order in which data
is stored in the table (in fact, the leaf level of a clustered index
maps directly to the stored data pages), there is no additional page
access cost associated with this index for each qualifying row. For
tables where a clustered index is appropriate, this can again improve
data manipulation query performance. A clustered index forces the
table to store tuples in sorted order, so there is both sorting and
page splitting overhead associated with inserts, deletes, and updates
that may make this type of index inappropriate for either extremely
dynamic or extremely large tables that are subject to modification.
Expansion overhead for a clustered index can be reduced using SQL
Server's FILLFACTOR feature. Using FILLFACTOR reserves a percentage of
free space on each data page of the index to accommodate future
expansion. This is a tuning issue that the developer and system
administrator will need to address when considering the use of
clustered indexes.
Finally, if a unique index is created for the table so that no
duplicate index values can be inserted, there is a special case that
reduces processing costs. If all of the keys in the WHERE clause use
the equality relational operator and are contained in the unique
index, the optimizer knows that this query can match only one row in
the table. In this case there is a minimal fixed cost involved in
using the unique index to process the query.
Once all of the clauses have been processed for index selection and
have an associated processing time, the optimizer is ready to perform
join selection. The optimizer compares all combinations of the
clauses, then selects the join plan with the lowest estimated
processing costs. Because the combinations of the clauses can grow
exponentially as the complexity of a query increases, SQL Server uses
tree pruning techniques to minimize the overhead associated with these
comparisons.
Using Tuning Tools
------------------
SQL Server includes several tuning tools that allow users to view this
internal optimization process. After executing the command
set showplan on
All queries sent to SQL Server will return not only the expected
results, but also a description of the query execution plan the
optimizer used to produce these results. For each step of the
execution plan, this description shows whether the optimizer decided
to use an existing index or a table scan to retrieve the results.
After SQL Server executes the command
set statistics time on
it displays the time taken to parse and compile each statement, along
with the time that it took to execute each step in the query execution
plan. These two commands used together can help users modify their
queries and index structures to further improve the performance of the
SQL Server database engine.
As you can see from this discussion of SQL Server optimization, there
are no hard and fast rules -- such as "when the number of rows exceeds
100" -- that you can use to determine when SQL Server will take
advantage of indexes to assist in query processing. However, by
understanding the underlying processes of the optimizer and analyzing
the types of queries being performed on the tables, you can create
specific indexes for your data tables that will help your SQL Server
optimizer improve system throughput.