Prevent "Stale Data" Problems

R. L. Parker

In this article, R.L. Parker shows how to solve the "stale" data problem, but he spares you the intimidating jargon that sometimes accompanies articles on transaction processing in distributed systems.

Clients in so-called "n-tier" applications are typically "disconnected" from the data source. That is, the middle-tier component doesn't hold a database connection open on behalf of the client. The component opens a connection (or reuses a pooled connection), gets the data, closes the connection, and then returns the data to the client. Then, when the client wants some more data, or to update some data that it previously fetched, the component opens a new connection to fulfill the client request. In these cases, the application and component must cooperate in order to detect attempts to update stale data.

Some application programmers ignore this problem. This is especially true in applications where there's a low probability that more than one user will try to update the same record in a given time period. The attitude seems to be "Well, it probably won't happen, so let's don't waste a lot of time coding for it." Of course, this attitude is dangerous in light of Murphy's law: If it can happen, it will happen.

In this article, I'll illustrate what I mean by stale data and present a relatively simple technique that you can employ to detect an attempt to update stale data. The technique uses the SQL Server timestamp datatype.

The scenario
Suppose two users (let's be incredibly creative and call them UserA and UserB) are updating the same set of records. Consider the following sequence of events: UserA gets a list of candidate records. In our example, it's a lookup list of car manufacturers (see Figure 1). Now UserB gets the same list (see Figure 2).

UserA modifies a record. In this case, UserA changes the spelling of the first record from "BMX" to "BMW" (see
Figure 3).

After UserA pushes the Save button, the record is updated. UserB's data is stale because he still has the original data. Now, suppose UserB wants to modify some information in the same record that UserA updated. UserB intends to change "BMX" to "BMY." I'll refer to this as the "stale update" situation (see
Figure 4).

What should happen when UserB pushes the Save button? Well, the answer is going to depend on the application requirements, but we can probably agree that the application shouldn't just ignore the problem. If the application ignores the possibility of the stale data problem, UserA is justifiably going to complain that there's something wrong with the application. The change from "BMX" to "BMW" got lost! At the very least, it would be polite for our application to notify UserB that he's attempting to update stale data (see
Figure 5). Then UserB would be in a position to make an informed decision about what to do next: Update anyway because he's smarter than UserA? Cancel the operation because UserA knows something he doesn't? Go get UserA and talk the whole thing over?

After UserB sees this error message, he can re-fetch the data and re-apply his changes (if appropriate). In addition to the simple error message shown in
Figure 5, a more sophisticated application might show UserB the current record side-by-side with the image he's trying to save. The point is that the application requirements should address the possibility and state what's to be done if the stale update situation arises.

The solution
The data source locking mechanisms are no help to us in this scenario. That's because while locking can prevent UserB from reading data that is being changed by UserA, and it can prevent UserA and UserB from changing data at the same time, our "stale data" scenario fits neither of those descriptions.

Luckily, we're using SQL Server (6.5 or 7.0). Any SQL Server table can have zero or one column defined as type timestamp. Despite its name, values of this type don't have anything to do with time, other than the fact that, as the record is updated, the timestamp column's value increases monotonically over time.

SQL Server automatically changes the value in the timestamp column when a row is inserted or updated. The value of the timestamp can't be updated explicitly. Given these two facts about tables with timestamp columns, we can introduce a technique that ensures that an application can always detect the stale update situation.

By the way, ADO disconnected recordsets provide another possible solution to the problem, but in many applications, I prefer not to use them. That's because they introduce a level of overhead and complexity that isn't necessary to meet the other application requirements. I use ADO, but only on the component side; the component returns result sets to the client in a variant. This comes directly from the "GetRows" method on an ADO recordset object. Microsoft's Bill Vaughn, author of Hitchhiker's Guide to Visual Basic and SQL Server, calls this the "firehose" technique -- it's very fast and very lightweight.

To implement the timestamp solution, we need to do two things: Add support in the database and modify our component code to take advantage of the timestamp columns.

In the database, we need to identify candidate tables (not every table in the database will need a timestamp column -- only the ones that are possible targets of stale updates) and add the timestamp column to the selected table(s) (see
Figure 6).

In the component, we have to code some support for detecting the stale update situation:

1. First, we need to change the code that returns the list. The client code needs to get (and remember) the value of the timestamp column for each row. For example, here's part of our VB "GetX" function:

  Public Function GetRowsTable1() As Variant
  'returns the following columns:
  'Table1.ID, Table1.sValue, Table1.ts
  Dim oData As CData
  Dim sSQL As String    
  Set oData = New CData
  sSQL = "SELECT t.ID, t.sValue, t.ts"
  'return the PK, the data, 
  'and the timestamp column (ts)
  sSQL = sSQL & " FROM Table1 t"
  sSQL = sSQL & " ORDER BY t.sValue"    
  GetRowsTable1 = oData.ExecuteSQL(ConnectString, _
     sSQL)


2. Next, we need to add a parameter to the UpdateX method on the component's interface. The application will use this parameter to pass us the timestamp value for the row we're going to try to update.

Public Function UpdateRowTable1(ByVal lTable1ID _
   As Long, ByVal sValue As String, ts As Variant)


Note that the timestamp parameter (ts) is declared as a variant. It's actually an 8-byte binary value that SQL Server generated when the row was inserted or updated.

3. Finally, we need to change the UPDATE statement. This is where the component actually detects that the stale update situation has happened.

   Dim oData As CData
   Dim sSQL As String
  'use parameterized SQL to avoid problems 
  'with apostrophes in sDescription
   Dim vParams As Variant
   Dim v2dResults As Variant
   Dim lRowCount As Long    
   ReDim vParams(0 To 2, 0 To 0)    
   Set oData = New CData
   sSQL = "set nocount on;"
   sSQL = sSQL & " UPDATE Table1"
   sSQL = sSQL & " SET sValue = ?"
   sSQL = sSQL & " WHERE ID = " & CStr(lTable1ID)
   sSQL = sSQL & " AND ts = " & MakeTS(ts)
   sSQL = sSQL & ";"
   sSQL = sSQL & " SELECT @@ROWCOUNT"   
   vParams(0, 0) = sValue
   vParams(1, 0) = DLParamType.dlParamInput
   vParams(2, 0) = vbString
   v2dResults = oData.ExecuteSQL(sConnection:=_
     ConnectString, sSQL:=sSQL, vaParameters:=vParams)
   lRowCount = v2dResults(0, 0)
   If lRowCount = 0 Then
      Err.Raise CComponentError.ccDataChanged, , _
      "The data has been changed by another process _
      since you fetched it."
   End If


Note that the WHERE clause in our UPDATE statement includes a criterion involving the timestamp value that was passed in to UpdateX. We'll only update the row if the timestamp value that the client passed in is the same as the timestamp value currently in the specified row in the database. By selecting @@ROWCOUNT after we attempt the update, we can determine whether or not the criteria were met and the update actually occurred. If not, we raise the error ccDataChanged. Note that this approach would also raise an error if the record we're trying to update had been deleted by another process. If this is a possibility, we need to refine the error message associated with the ccDataChanged error.

MakeTS is a utility function that changes the 8-byte binary value into a string representation that we can use in our UPDATE statement.

Conclusion
Our application now detects the stale update situation if it arises. If it does, the application notifies the user, allowing the user to determine the appropriate course of action.

Due to the simplicity and low overhead of the timestamp technique, I think it's ideal for those applications or parts of applications that have a possibility, but not a probability, of encountering the stale data update situation.


R. L. Parker is a Microsoft Certified Solution Developer and Master Technical Lead at DB Basics, Inc., in Raleigh, NC, who specializes in mentoring and custom development of mission-critical database applications. rlp@dbbasics.com.


Sidebar: Variants vs. Disconnected Recordsets
In the accompanying article, I said that I usually return query results to the client code in a VB variant instead of an ADO disconnected recordset because of the comparative overhead and complexity of ADO. I welcomed the opportunity to expand on that statement a bit, because I admit that I'd been operating on assumptions -- I'd never actually seen any data comparing the two techniques.

Some background
On the component side, I usually use the GetRows method of an ADO Recordset. This method stuffs the query results, as a two-dimensional array, into a VB variant. This technique uses a forward-only, read-only, single-row "cursor." Bill Vaughn calls these "firehose cursors" in his book, Hitchhiker's Guide to Visual Basic and SQL Server. The advantages? The fetch itself is very fast, and the database doesn't need to hold any locks against the fetched data, improving database concurrency. The client code gets the raw data from the component and caches the result set if it so desires. Of course, the client is on its own if it needs to update any of the fetched data.

But ADO has another technique that can be used to accomplish the same goals -- a cursorless fetch and client-side data caching. The technique is called "disconnected recordsets." The main difference is that, instead of raw data in a two-dimensional array, the component returns a recordset that's an object with a lot of useful properties and methods that the client code can take advantage of. One such method is UpdateBatch, which can be used update the entire disconnected recordset when the client code returns it to the component. This method raises an error if it detects the "stale data" situation.

The question is, do the advantages of ADO recordsets outweigh their presumed performance disadvantages?

The performance test set-up
In order to test the various performance questions, I created a component (an ActiveX EXE) that fetches data from an Access database using ADO. The table that I fetch from has an AutoNumber column, two 32-character string columns, and two long number columns. The various tests were done on the table when it contained 100, 1,000, 10,000, and 20,000 rows.

The component has various methods on its interface, including ones that:


The first performance question has to do with fetching data. How long does it take to fetch and display a recordset using each of the two techniques? I broke the question down into two parts: 1) How long does it take to fetch the data; and 2) How long does it take to navigate through the result set in order to display it? To test the latter, I load a column from each row into a standard ListBox (see
Figure 1a).

The second performance question has to do with updating data through either the VB variant or the recordset. What if the application usually has to update only a small number of rows? What if it usually has to update the entire recordset?

The tests were implemented in the following environment: compiled VB 5.0, Access 97, Windows NT Workstation 4.0 with SP3, and ADO 2.0.

The performance results
Table 1 shows the performance results for fetching data, then navigating and displaying using both techniques. As you can see, as the number of rows in the result set gets larger, the ADO Recordset actually provides better fetch performance, but the Variant technique always shows better navigation performance.

Table 1. Performance results for fetching data, then navigating and displaying using both techniques.
Rows in table Fetch (secs) Navigate/display (secs)
  Variant: Recordset: Variant: Recordset:
100 0.320 0.490 0.030 0.060
1,000 0.701 0.680 0.331 0.570
10,000 6.179 4.066 4.567 6.510
20,000 15.913 7.401 13.198 18.928


Table 2 shows the performance results for updating data. The Variant technique shows generally better performance.

Table 2. Performance results for updating data.
Rows in table Update single row(secs) Update all rows (secs)
  Variant: Recordset: Variant: Recordset:
100 0.130 0.481 1.643 2.143
1,000 0.121 0.241 14.651 20.540
10,000 0.120 1.161 143.236 210.022


Conclusion
As is often the case, there are tradeoffs to consider as you make a decision on which technique to use. The answer is going to depend on the requirements of the particular application you're developing. The Variant technique provides generally better performance (why would you be fetching 10,000 rows back to the client side, anyway?), but the ADO Recordset technique provides some functionality that's absent if you're using the Variant technique. It's good to have both tools in your toolbox so that you can pick the appropriate one when you need it.