The information in this article applies to:
SUMMARYMicrosoft SQL Server version 7.0 includes some new join techniques, such as the hash join and sort-merge join, to supplement nested-loop joins. This article describes these new techniques and compares and contrasts them. MORE INFORMATION
Microsoft SQL Server version 6.5 and earlier had only one technique for
joining records: the nested loops join. SQL Server 7.0 implements hash and
sort-based algorithms. The decision by the query engine on whether to use
hash versus sort depends on relative sizes of the inputs, whether they are
sorted or not, whether the hash table will fit entirely in memory, and so
on. The optimizer will always choose the best join plan to execute your
query, and that may be any one of the three options. The difference between
hash joins and merge joins is often small, but there are certain situations
where one is better than the other. There are considerable differences
between these two join types and the nested loops join, as shown below. The
example in this article is provided only as an illustration. The table at
the end of this article is an illustration of some decisions that may
affect the join type used. It is generally recommended that you not force
join types, overriding the optimizer.
Nested Loop JoinsIn a nested loops join, one of the tables is scanned; for every row in that table, a lookup is performed on the second table and the matching rows are retrieved. We will call these tables outer and inner input. The basic algorithm is to scan the inner input once for each value of outer input, looking for a match. Depending on the presence of indexes on the inner input the lookup can run quite efficiently. Indexed nested loops join is obviously more efficient than non-indexed nested loops. Also, if the tables have a 1-1 relationship, the scan of the inner table can stop as soon as the row is found that matches the row in the outer table.Any query can be processed using this methodology including joins on inequality conditions. Nested loops example:
(loop join)Reserves has 529 pages and 100000 rows Sailors has 195 pages 40000 rows Table 'sailors'. Scan count 100000, logical reads 207596, physical reads 3, read-ahead reads 0. Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read- ahead reads 0.
Hash JoinsIn a hash match, a hash table is created in memory for one of the inputs (build input) using a repeatable randomizing function and this table is searched for matching values from the other input (probe input). The hash table performs like an index. The data from the build input is stored in memory. If the build input does not fit in memory it is partitioned recursively until it fits in memory. The probe input is then hashed and used to search for matches. Unlike with nested loops, the presence or absence of indexes is not particularly a concern in this case. Hash joins are CPU-intensive in comparison to nested loops and are affected by available memory. Hash joins are better when there is a significant difference in the sizes of the tables being joined.In a hash join the build input determines the recursion depth because it can stop partitioning when the input fits in memory. A single hash join can perform both grouping and join at the same time when the grouping attribute is also the join attribute. The result of this join is not in any particular order. Inequality conditions cannot be satisfied by this type of join.
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read-
ahead reads 0.Table 'sailors'. Scan count 1, logical reads 208, physical reads 0, read- ahead reads 0. Sort-Merge JoinIn a sort-merge, sorting the larger input dominates a large portion of the total cost. In a merge join, the inputs are sorted on the join column and the sorted pair is merged into one, eliminating the column values that do not match from the sorted result set.The algorithm first examines the tuple of one input, scans the second sorted input until a row with equal or larger join value is found. Match up tuple in R1 with all tuples in R2 with matching join value. Repeat for all tuples in R1. Merge join requires that both inputs be in memory (as opposed to a hash join, where only the smaller input is in memory). The merge join is obviously more efficient if the inputs are already sorted. Also the result of the join is sorted on the join attribute. Inequality conditions cannot be satisfied by this type of join.
Table 'sailors'. Scan count 1, logical reads 208, physical reads 0, read- ahead reads 0. Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read- ahead reads 0. By changing the size of the input such that the relative sizes of the inputs vary, (in our example reserves had only 500 rows and sailors 50,000 rows) merge join for the same query as above took much longer than hash as can be seen below: Merge join: SQL Server Execution Times: CPU time = 1081 ms, elapsed time = 2211 ms. Hash join: SQL Server Execution Times: CPU time = 181 ms, elapsed time = 479 ms. If an index exists on the join key, the data does not need to be sorted and therefore a merge join will be the best choice.
Additional query words: prodsql Comparison optimizer
Keywords : |
Last Reviewed: April 19, 1999 © 2000 Microsoft Corporation. All rights reserved. Terms of Use. |