The Ubiquitous WHILE Loop

Ron Talmage

This month, columnist Ron Talmage shows how to use WHILE instead of cursors (why? better code, fewer lines of code, and less overhead) and how he used a WHILE to trim a mass update down to a manageable runtime. His article title is patterned after a famous paper from the late ’70s called "The Ubiquitous B-Tree."

Try as we might to avoid it, there are times when we simply must navigate through data in a particular order. The usual way of accomplishing this in SQL Server is through a T-SQL server-side cursor. But writing cursors is fraught with danger. They can seem to be an easier solution to something that should be solved by more elegant but harder SQL commands. Because they use a sequential access method in a set-oriented relational database, they also tend to incur a fair amount of overhead.

I’m convinced that even when we need to navigate through data, cursors generally aren’t necessary. In this column, you’ll see how you can use a comparison-based WHILE loop to avoid cursors. In addition, you’ll see how a comparison-based WHILE loop can improve the performance of a large mass update.

Navigating a table using WHILE

There are two general approaches to processing database data: set-oriented processing using SQL and sequential processing using a cursor mechanism. The difference is that the set-oriented SQL commands perform their operations on an entire table (a set of rows) or a subset of a table that’s been "restricted"–typically by a WHERE clause. The point is, your SQL operation doesn’t ever locate you at a particular row in the table. The actual physical order of the rows that are processed by a command is, for all practical purposes, irrelevant.

Sequential processing, on the other hand, lets you navigate a list of rows. You can open a cursor and start at the "first" row of a table, which might be the first physical row or a row with the lowest value of some key, and work your way down to the "last" row, the row with the highest value of the key.

Sometimes navigating a table sequentially is necessary, but you probably don’t need a cursor to do it. If you base a WHILE loop on comparisons of values in the table, you can "navigate" the table using standard SQL commands, doing the same work as a cursor with less code. For example, the following loop steps through the rows of the pubs database’s authors table on the primary key au_id:

DECLARE @au_id CHAR(11)
SET @au_id = (SELECT MIN(au_id) FROM authors)
WHILE @au_id IS NOT NULL
  BEGIN
    -- Process the row with that id
    -- Step forward:
    SET @au_id = (SELECT MIN(au_id) FROM authors
      WHERE au_id > @au_id)
  END

The first SET command assigns the lowest value to @au_id by selecting the minimum value of au_id from authors. Then, as long as that variable contains data, you can process the row based on the key value stored in the variable.

To advance to the next row, the loop gets the next highest value and assigns it to @au_id. To get the next highest value, the second SET command assigns a new value to @au_id by selecting the minimum value of au_id from authors–but only the minimum value among the au_ids that are greater than the current value in @au_id. The result is to simulate stepping forward row by row. If you wanted to step backward, then you’d just get the MAX() value of au_id among all of the au_ids that are smaller than the current one:

SET @au_id = (SELECT MAX(au_id) FROM authors
  WHERE au_id < @au_id)

To get to the "last" row of the table, based on the au_id key, just SELECT the MAX() of au_id from authors:

SET @au_id = (SELECT MAX(au_id) FROM authors)

Now let’s compare this technique with a cursor.

Replacing a cursor with WHILE

A common way to run a maintenance routine–say, DBCC against a set of databases–is to write a cursor against sysdatabases in master and then EXEC the DBCC command against the database:

USE Master
go
DECLARE @db VARCHAR(100), @cmd VARCHAR(500)
DECLARE DbCursor CURSOR 
FOR
SELECT name FROM sysdatabases
OPEN DbCursor
FETCH NEXT FROM DbCursor INTO @db
WHILE @@FETCH_STATUS = 0
  BEGIN
    SET @cmd = 'DBCC CHECKDB(' + rtrim(@db) + ')'
    EXEC(@cmd)
    FETCH NEXT FROM DbCursor INTO @db
  END
CLOSE DbCursor
DEALLOCATE DbCursor

The cursor also uses a WHILE loop in order to repeat the FETCH command and walk the table. It’s also creating a physically ordered copy of the rows in order to advance through the table.

But you don’t need a cursor to accomplish this task. Instead, since database names on a given server must be unique, you can use a comparison-based WHILE loop:

USE Master
go
DECLARE @db VARCHAR(100), @cmd VARCHAR(1000)
SET @db = (SELECT MIN(CATALOG_NAME) 
  FROM INFORMATION_SCHEMA.SCHEMATA)
WHILE @db IS NOT NULL
  BEGIN
    SET @cmd = 'DBCC CHECKDB(' + @db + ')'
    PRINT @cmd
    EXEC (@cmd)
    SET @db = (SELECT MIN(CATALOG_NAME) 
      FROM INFORMATION_SCHEMA.SCHEMATA
      WHERE CATALOG_NAME > @db)
  END

Not only does it take fewer lines, the comparison-based navigation doesn’t incur the overhead of temporary tables that a cursor might require. Also, there’s no need for a physically ordered copy of the table. (The preceding code used an ANSI view, preferable to directly accessing a system table.) Plus, you don’t need to declare whether you’re in an update mode in order to update data. You’re always in a full update mode, because at any time you can issue an UPDATE or DELETE command–even an INSERT command, which you can’t do in a cursor.

When can a key-based WHILE loop not replace a cursor? If you must use the functionality of FETCH ABSOLUTE or RELATIVE, or if you must walk a table row by row based on non-unique values of a column, then you’ll need to use a cursor.

Using WHILE to optimize a mass update

Recently I helped out with a fairly large data migration. The data loaded fast enough, but further processing on the data after the load was taking far too much time. A legacy account table with eight million rows (3G) had username values in every row–all linked to a small users table that contained about 1,000 rows. The new design called for replacing the username with the userid from the smaller table. The customer’s first mass update attempt, shown in the following code, ran for more than 12 hours before they canceled the query. (The rollback itself took several hours.) This was during a "dry run" of a planned cutover, so it was imperative to improve the query’s execution time.

UPDATE Account
  SET a.userid = u.userid
  FROM Account, Users
  WHERE a.username = u.username

To speed the query up, the first thing we did was change the clustered index on the Account table using a new SQL Server 7.0 construct:

CREATE CLUSTERED INDEX ixc_Account IN 
Account(username) WITH DROP_EXISTING

What this does is rewrite the clustered index in one command. The new clustered index reduced the query time to five hours. This was still too long for the cutover schedule. We checked all of our indexes, made sure the statistics were updated, used the Query Analyzer’s Index Analysis utility, and it still took five hours.

Then we put the update into a WHILE loop based on comparing the larger table’s primary key value, stepping through the large table row by row, eight million times, with each update affecting only one row. The execution time was reduced to just over three hours–finally, we had our clue! We then refined the WHILE loop, basing it on the username from the smaller table:

DECLARE @username VARCHAR(20), @userid INT
SELECT @username = username, @userid = userid
       FROM users 
       WHERE username = MIN(username)
WHILE @username IS NOT NULL
  BEGIN
  UPDATE Account
       SET userid = @userid
       WHERE username = @username
  SELECT @username = username, @userid = userid
       FROM users 
       WHERE username = MIN(username)
       AND username > @username
  END

Now the query time was reduced to 43 minutes, a far more reasonable figure! We also liked removing the join. (Note that this WHILE doesn’t rely on a unique key. Instead, it updates the table in 1,000 different-sized chunks, based on the username.) I’ve since found similar performance improvements on other mass updates of large tables. The "chunked" update in a WHILE loop seems consistently faster than a single update when the table being updated is larger than the available RAM.

I talked about this query with some major SQL Server brains after our last SQL Server user group meeting [Ron is head of the Redmond-based Pacific Northwest SQL Server Users Group, so you can probably figure out what he means by "major SQL Server brains."–Ed.], and their consensus was that the final version performed better because the WHILE loop segmented the data into chunks small enough for SQL Server to do all of its work in RAM.

Additional benefits

There are some significant benefits to using a comparison-based WHILE loop on a mass update. Since the updates are smaller, if you need to cancel the query, the rollbacks are shorter. If you add some additional logic to detect where you left off, you can cancel the query and continue it later without having to restart at the beginning. SQL Server doesn’t provide a way of estimating how much time a query will take or, after it starts, how much work the query has done and how much more time it will take. We don’t get progress bars for our queries!

However, it’s a simple task to insert code into a comparison-based WHILE loop to display how many iterations have taken place, how much time they’ve taken, and how many rows have been processed. With that information, you can see how well a query is progressing, and you can reasonably estimate how long it will take to finish.

Ron Talmage is senior database developer with GTE Enterprise Solutions in Kirkland, WA, and a SQL Server instructor for Data Dimensions, in Bellevue, WA. Ron is author of SQL Server Administrator’s Guide from Prima Publishing and is a SQL Server MVP, an MCSD, and an MCP in SQL Server. He’s also the current president of the Pacific Northwest SQL Server Users Group. RonTalmage@compuserve.com.