This month, columnist Ron Talmage shows how to use WHILE instead of cursors (why? better code, fewer lines of code, and less overhead) and how he used a WHILE to trim a mass update down to a manageable runtime. His article title is patterned after a famous paper from the late 70s called "The Ubiquitous B-Tree."
Try as we might to avoid it, there are times when we simply must navigate through data in a particular order. The usual way of accomplishing this in SQL Server is through a T-SQL server-side cursor. But writing cursors is fraught with danger. They can seem to be an easier solution to something that should be solved by more elegant but harder SQL commands. Because they use a sequential access method in a set-oriented relational database, they also tend to incur a fair amount of overhead.
Im convinced that even when we need to navigate through data, cursors generally arent necessary. In this column, youll see how you can use a comparison-based WHILE loop to avoid cursors. In addition, youll see how a comparison-based WHILE loop can improve the performance of a large mass update.
Navigating a table using WHILE
There are two general approaches to processing database data: set-oriented processing using SQL and sequential processing using a cursor mechanism. The difference is that the set-oriented SQL commands perform their operations on an entire table (a set of rows) or a subset of a table thats been "restricted"typically by a WHERE clause. The point is, your SQL operation doesnt ever locate you at a particular row in the table. The actual physical order of the rows that are processed by a command is, for all practical purposes, irrelevant.
Sequential processing, on the other hand, lets you navigate a list of rows. You can open a cursor and start at the "first" row of a table, which might be the first physical row or a row with the lowest value of some key, and work your way down to the "last" row, the row with the highest value of the key.
Sometimes navigating a table sequentially is necessary, but you probably dont need a cursor to do it. If you base a WHILE loop on comparisons of values in the table, you can "navigate" the table using standard SQL commands, doing the same work as a cursor with less code. For example, the following loop steps through the rows of the pubs databases authors table on the primary key au_id:
DECLARE @au_id CHAR(11)
SET @au_id = (SELECT MIN(au_id) FROM authors)
WHILE @au_id IS NOT NULL
BEGIN
-- Process the row with that id
-- Step forward:
SET @au_id = (SELECT MIN(au_id) FROM authors
WHERE au_id > @au_id)
END
The first SET command assigns the lowest value to @au_id by selecting the minimum value of au_id from authors. Then, as long as that variable contains data, you can process the row based on the key value stored in the variable.
To advance to the next row, the loop gets the next highest value and assigns it to @au_id. To get the next highest value, the second SET command assigns a new value to @au_id by selecting the minimum value of au_id from authorsbut only the minimum value among the au_ids that are greater than the current value in @au_id. The result is to simulate stepping forward row by row. If you wanted to step backward, then youd just get the MAX() value of au_id among all of the au_ids that are smaller than the current one:
SET @au_id = (SELECT MAX(au_id) FROM authors
WHERE au_id < @au_id)
To get to the "last" row of the table, based on the au_id key, just SELECT the MAX() of au_id from authors:
SET @au_id = (SELECT MAX(au_id) FROM authors)
Now lets compare this technique with a cursor.
Replacing a cursor with WHILE
A common way to run a maintenance routinesay, DBCC against a set of databasesis to write a cursor against sysdatabases in master and then EXEC the DBCC command against the database:
USE Master
go
DECLARE @db VARCHAR(100), @cmd VARCHAR(500)
DECLARE DbCursor CURSOR
FOR
SELECT name FROM sysdatabases
OPEN DbCursor
FETCH NEXT FROM DbCursor INTO @db
WHILE @@FETCH_STATUS = 0
BEGIN
SET @cmd = 'DBCC CHECKDB(' + rtrim(@db) + ')'
EXEC(@cmd)
FETCH NEXT FROM DbCursor INTO @db
END
CLOSE DbCursor
DEALLOCATE DbCursor
The cursor also uses a WHILE loop in order to repeat the FETCH command and walk the table. Its also creating a physically ordered copy of the rows in order to advance through the table.
But you dont need a cursor to accomplish this task. Instead, since database names on a given server must be unique, you can use a comparison-based WHILE loop:
USE Master
go
DECLARE @db VARCHAR(100), @cmd VARCHAR(1000)
SET @db = (SELECT MIN(CATALOG_NAME)
FROM INFORMATION_SCHEMA.SCHEMATA)
WHILE @db IS NOT NULL
BEGIN
SET @cmd = 'DBCC CHECKDB(' + @db + ')'
PRINT @cmd
EXEC (@cmd)
SET @db = (SELECT MIN(CATALOG_NAME)
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE CATALOG_NAME > @db)
END
Not only does it take fewer lines, the comparison-based navigation doesnt incur the overhead of temporary tables that a cursor might require. Also, theres no need for a physically ordered copy of the table. (The preceding code used an ANSI view, preferable to directly accessing a system table.) Plus, you dont need to declare whether youre in an update mode in order to update data. Youre always in a full update mode, because at any time you can issue an UPDATE or DELETE commandeven an INSERT command, which you cant do in a cursor.
When can a key-based WHILE loop not replace a cursor? If you must use the functionality of FETCH ABSOLUTE or RELATIVE, or if you must walk a table row by row based on non-unique values of a column, then youll need to use a cursor.
Using WHILE to optimize a mass update
Recently I helped out with a fairly large data migration. The data loaded fast enough, but further processing on the data after the load was taking far too much time. A legacy account table with eight million rows (3G) had username values in every rowall linked to a small users table that contained about 1,000 rows. The new design called for replacing the username with the userid from the smaller table. The customers first mass update attempt, shown in the following code, ran for more than 12 hours before they canceled the query. (The rollback itself took several hours.) This was during a "dry run" of a planned cutover, so it was imperative to improve the querys execution time.
UPDATE Account
SET a.userid = u.userid
FROM Account, Users
WHERE a.username = u.username
To speed the query up, the first thing we did was change the clustered index on the Account table using a new SQL Server 7.0 construct:
CREATE CLUSTERED INDEX ixc_Account IN
Account(username) WITH DROP_EXISTING
What this does is rewrite the clustered index in one command. The new clustered index reduced the query time to five hours. This was still too long for the cutover schedule. We checked all of our indexes, made sure the statistics were updated, used the Query Analyzers Index Analysis utility, and it still took five hours.
Then we put the update into a WHILE loop based on comparing the larger tables primary key value, stepping through the large table row by row, eight million times, with each update affecting only one row. The execution time was reduced to just over three hoursfinally, we had our clue! We then refined the WHILE loop, basing it on the username from the smaller table:
DECLARE @username VARCHAR(20), @userid INT
SELECT @username = username, @userid = userid
FROM users
WHERE username = MIN(username)
WHILE @username IS NOT NULL
BEGIN
UPDATE Account
SET userid = @userid
WHERE username = @username
SELECT @username = username, @userid = userid
FROM users
WHERE username = MIN(username)
AND username > @username
END
Now the query time was reduced to 43 minutes, a far more reasonable figure! We also liked removing the join. (Note that this WHILE doesnt rely on a unique key. Instead, it updates the table in 1,000 different-sized chunks, based on the username.) Ive since found similar performance improvements on other mass updates of large tables. The "chunked" update in a WHILE loop seems consistently faster than a single update when the table being updated is larger than the available RAM.
I talked about this query with some major SQL Server brains after our last SQL Server user group meeting [Ron is head of the Redmond-based Pacific Northwest SQL Server Users Group, so you can probably figure out what he means by "major SQL Server brains."Ed.], and their consensus was that the final version performed better because the WHILE loop segmented the data into chunks small enough for SQL Server to do all of its work in RAM.
Additional benefits
There are some significant benefits to using a comparison-based WHILE loop on a mass update. Since the updates are smaller, if you need to cancel the query, the rollbacks are shorter. If you add some additional logic to detect where you left off, you can cancel the query and continue it later without having to restart at the beginning. SQL Server doesnt provide a way of estimating how much time a query will take or, after it starts, how much work the query has done and how much more time it will take. We dont get progress bars for our queries!
However, its a simple task to insert code into a comparison-based WHILE loop to display how many iterations have taken place, how much time theyve taken, and how many rows have been processed. With that information, you can see how well a query is progressing, and you can reasonably estimate how long it will take to finish.
Ron Talmage is senior database developer with GTE Enterprise Solutions in Kirkland, WA, and a SQL Server instructor for Data Dimensions, in Bellevue, WA. Ron is author of SQL Server Administrators Guide from Prima Publishing and is a SQL Server MVP, an MCSD, and an MCP in SQL Server. Hes also the current president of the Pacific Northwest SQL Server Users Group. RonTalmage@compuserve.com.