Shridhar Balasubramaniam
As columnist Andrew Zanevsky observes in his comments on this article, "characteristic functions were a clever way of working around T-SQL deficiencies in old versions of SQL Server"the CASE function and cheap storage lessen the appeal of "bitwhacking." Nevertheless, Shridhar Balasubramaniams "real-world" story of working with mainframe data makes for very interesting reading.
I was recently involved with a loan-servicing application for a mortgage firm. The data resided in a Microsoft SQL Server 6.5 database, and a Windows-based front-end client application was used for updating loan information. On a daily basis, the changed customer information in SQL Server had to be uploaded into another application, Mortgage Service Application (MSA), which was a 3270 Net Soft Emulation Program. The information on MSA needed to be updated so that other applications accessing the MSA would look at the most recent data.
Since there were no direct links between these two systems, we decided to produce a daily file out of SQL Server that showed the changed values for each loan. Views were used to produce the result set. Using bcp, the output of this view was written to a flat file, which was uploaded into MSA.
Loan information existed in three mutually exclusive tables: LOAN, JUNIOR_LIENS, and SENIOR_LIENS. These tables had some common fields such as customer details and other fields that were specific to the type of loan. We needed to monitor 25 columns spread across the three tables. Our first approach was to define a table with loan number as the key and flags that represent the individual columns of the tables. Whenever any column value in the table was changed, the relevant flag would be updated. However, if new columns ever needed to be monitored, or (far more unlikely) we ever discarded monitoring certain columns, wed have to alter the table.
In view of this, we decided to use bitmapped columns. First, we created a lookup table that defined values for changes made to columns in all three tables. Next, we created a table (LOAN_CHANGES) that had two columns: loan number and a bitmapped column that represents all changes made to the loan. From a design standpoint, this allowed more flexibility, as we could now add or drop columns by creating or modifying entries in the lookup table.
In both methods, triggers and views (or stored procedures) had to be changed to allow for the adding or dropping of columns. The LOAN_CHANGES table, however, didnt need to be modified with the bitmapped method.
In this article, Ill describe our three-stage approach: data definition, data capture, and data retrieval. Although the real application has three tables and 25 columns, Ill restrict my discussion to just one table (LOAN) and six columns in this table. The data capture and data retrieval sections illustrate the main steps rather than actual code. Ill also show how to do the final report using encoded characteristic functions.
Data definition
First we define a lookup table for changes associated with each table. The Change_Bit value is stored as a power of 2 (in the range 0-30). Since there are three tables and the columns could overlap, we add the table name to uniquely identify each change. For example, Home1_Phone_Number is applicable to LOAN, JUNIOR_LIENS, and SENIOR_LIENS. Table 1 shows sample values.
Table 1: . Entries in lookup table CHANGE_CODES.
Table_Name |
Column_Name |
Change_Bit |
Active |
LOAN |
Mailing_Address |
1 Power (2,0) |
Y |
LOAN |
Mailing_City_State |
2 Power (2,1) |
Y |
LOAN |
Mailing_Zip |
4 Power (2,2) |
Y |
LOAN |
Business1_Phone_Number |
8 Power (2,3) |
Y |
LOAN |
Business1_Ext |
16 Power (2,4) |
Y |
LOAN |
Home1_Phone_Number |
32 Power (2,5) |
Y |
CREATE TABLE CHANGE_CODES
(Table_Name varchar (30) NOT NULL, -- Primary Key1
Column_Name varchar (30) NOT NULL, -- Primary Key2
Change_Bit int NOT NULL,
Active char (1) NOT NULL)
Next, we define a table that stores all of the changes for a loan. This table is truncated prior to the start of every business day. The Total_Change_Bit column represents the total bit value for all of the changes associated with a given loan.
CREATE TABLE LOAN_CHANGES
(Loan_Number varchar (7) NOT NULL, -- Primary Key
Total_Change_Bit int NOT NULL)
The "LOAN" table stores loan details such as principal_balance, reappraisal_amount, interest_rate, and so forth. However, were only interested in those columns that can be changed.
CREATE TABLE LOAN
(Loan_Number varchar(7) NOT NULL, -- Primary Key
Mailing_Address varchar (40) NULL,
Mailing_City_State varchar (30) NULL,
Mailing_Zip varchar (10) NULL,
Business1_Phone_Number varchar (14) NULL,
Business1_Ext varchar (5) NULL,
Home1_Phone_Number varchar (14) NULL)
Data capture
The customer information is updated via a Windows-based front-end application. The Update Trigger for LOAN captures changes made to the individual columns, one loan at a time. The main steps are shown here:
-- Declare and initialize local variables that
-- store the total and original bit values
-- associated with a loan.
declare @Total_Change_Bit int, @Original_Change_Bit
int
select @Total_Change_Bit = 0, @Original_Change_Bit = 0
-- Get the current total bit value.
select @Original_Change_Bit = IsNull
(lc.Total_Change_Bit,0)
from LOAN_CHANGES lc, inserted i
where lc.Loan_Number = i.Loan_Number
-- If we're checking, for example, Mailing_Address,
-- our script would look like this:
IF UPDATE(Mailing_Address)
select @Total_Change_Bit = IsNull (@Total_Change_Bit
+ cc.Change_Bit ,0) from CHANGE_CODES cc
where cc.Table_Name = "LOAN"
and cc.Column_Name = "Mailing_Address"
and cc.Active = "Y"
and @Original_Change_Bit & cc.Change_Bit = 0
In the preceding script, we store the total bit value that represents all changes associated with the loan. When calculating the Total_Change_Bit for a loan, we add the bit value for the changed column only once. For example, if Mailing_Address is already changed once for a loan, then we dont add this value again. For checking column changes, we use the bitwise operator &. The bitwise operator & (AND) can be used on columns with a datatype of int, smallint, tinyint, or bit. All of the bitwise operators translate integer parameters into binary representations before evaluating them. For example, consider the following:
A = @Original_Change_Bit = 7 (00000111)
B = CHANGE_CODES.Change_Bit = 1 (00000001)
(A & B) = (000000001)
The result is 1 where both A and B are 1. Clearly, we must repeat the preceding steps for all columns, changing the Column_Name each time. After weve obtained the total bit value for all changes associated with a loan, we perform an INSERT or UPDATE on the LOAN_CHANGES table:
IF @Total_Change_Bit != 0
begin
if exists (select * from LOAN_CHANGES lc ,inserted i
where lc.Loan_number = i.Loan_Number)
update LOAN_CHANGES
set lc.Total_Change_Bit =
lc.Total_Change_Bit + @Total_Change_Bit
from LOAN_CHANGES lc ,inserted i
where lc.Loan_number = i.Loan_Number
else
insert LOAN_CHANGES(Loan_Number,Total_Change_Bit)
select i.Loan_Number,@Total_Change_Bit from inserted i
end
Table 2 shows the changed columns for Loan_Numbers and the Total_Change_Bit value.
Table 2: Updates on "LOAN" table and the Total_Change_Bit value.
Loan_Number |
Changed Columns |
Total_Change_Bit |
1000000 |
Mailing_Address, Mailing_City_State, Mailing_Zip |
7 (1+2+4) |
2000000 |
Business1_Phone_Number, Business1_Ext |
24 (8 + 16) |
3000000 |
Home1_Phone_Number |
32 |
Data retrieval
Now that we have data in our tables, lets see how we can arrive at the final report. For the examples mentioned in the previous section, Listing 1 shows the final report (formatting adjusted to save space).
Listing 1. Report table that shows loan_number and values for changed columns.
Loan_Number Mailing_Address Mailing_City_State
Mailing_Zip Business1_Phone_Number Business1_Ext
Home1_Phone_Number
-----------------------------------------------------------------------------------------
1000000 2775 Mesa Verde East,X212 Costa Mesa, CA 92626
Null Null Null
2000000 Null Null Null (714)541-9737 4496 Null
3000000 Null Null Null Null Null (714)437-5420
The report shows loan_number and values in columns that have changed. A Null indicates no column change. To create the report, we had to get the loan info and columns that have changed for these loans.
1. In the code available in the accompanying Download file, the SubString function gets the contents of a string (or column) given a start position and length. The length is the actual size as shown in the data definition section. If the start position is 1, the SubString function returns the entire string. However, if the start position is 0, it returns NULL. Therefore, we should try to get the changes for a loan by defining computed columns (1 or 0) for columns in the "LOAN" table. We defined these computed columns in a view vw_LOAN_changed_columns. We also used the charindex function for string comparison. It returns 1 where theres a match and 0 otherwise. On grouping by loan_number, we can aggregate these columns by adding the individual bits. These columns would result in 1 or 0.
If we execute this view (that is, select * from vw_LOAN_changed_columns), the output will be as shown in Table 3.
Table 3: . Output of the view vw_LOAN_changed_columns. The changed column(s) are 1; unchanged columns are 0.
Loan_Number |
Mailing_Address_Bit |
Mailing_City_State_Bit |
Mailing_Zip_Bit |
Business1_Phone_Number_Bit |
Business1_Ext_Bit |
Home1_Phone_Number_Bit |
1000000 |
1 |
1 |
1 |
0 |
0 |
0 |
2000000 |
0 |
0 |
0 |
1 |
1 |
0 |
3000000 |
0 |
0 |
0 |
0 |
0 |
1 |
2. We know what columns have changed for a loan. Create the final report by referencing the "LOAN" table.
In the final report, we have values only for columns that have changed. Nulls indicate no column changes. In the previous section, I discussed the SubString function. If the start position is 1, the SubString function returns the entire string. However, if the start position is 0, it returns NULL. The start position (1 or 0) is defined in view vw_LOAN_changed_columns.
create view vw_LOAN_report
as
select
l.Loan_Number, "Mailing_Address"
= SubString (l.Mailing_Address,v.Mailing_Address_Bit,
40),"Mailing_City_State" =SubString
(l.Mailing_City_State,v.Mailing_City_State_Bit,30),
"Mailing_Zip" = SubString (l.Mailing_Zip,
v.Mailing_Zip_Bit,10),"Business1_Phone_Number"
= SubString (l.Business1_Phone_Number,
v.Business1_Phone_Number_Bit,14),
"Business1_Ext" = SubString (l.Business1_Ext,
v.Business1_Ext_Bit,5),"Home1_Phone_Number"
= SubString (l.Home1_Phone_Number,
v.Home1_Phone_Number_Bit,14)
from loan l, vw_LOAN_changed_columns v
where l.Loan_Number = v.Loan_Number
The preceding view could also be defined using the matter-of-fact CASE expression. Instead of an expression like this:
SubString (l.Mailing_Address,v.Mailing_Address_Bit,40)
one could use this:
Mailing_Address =
CASE
WHEN v.Mailing_Address_Bit =1
THEN l.Mailing_Address
ELSE Null
END
The CASE expression serves two purposes: improving readability and not having to supply the column length as is required with the SubString function.
3. Alternatively, we can use encoded characteristic functions to form an expression for the start position in the SubString function. Consider the expression 1abs (sign (1- expression)). The two T-SQL functions used are Sign()which returns +1, -1, or 0, depending on the value of the expressionand the abs() function, which returns the absolute value of an expression, returning either a 0 or a positive number. If the inner expression = 0, the expression 1abs (sign (1- expression)) would return 0. The SubString function would return a NULL, since 0 isnt a valid start position. This is what we want, because the value in the column hasnt changed.
If the inner expression = 1, the expression 1abs (sign (1- expression)) would return 1. The SubString function would return the data in the column. We use the charindex function as the inner expression. It returns 1 or 0, depending on whether theres a match or not. Code for this view definition is available in the accompanying Download file.
Conclusion
"Bitwhacking" leads to more efficient space storage and improved processing efficiencies. From a design perspective, bitmap columns provide you with the ability to store many pieces of information in a single column and still use relational operators. In many cases, bitmap columns eliminate the need for an intermediate table to represent the many-to-many relationships. On the downside, bitwhacking isnt exactly "user-friendly," so wherever possible, views and stored procedures must be used to hide the complexity.
The report table could also have been created using subqueries on each column. However, when the number of columns is large and/or the table is long, subqueries arent efficient. In fact, it sometimes gives unresolvable runtime errors. If youre using stored procedures, its best to create intermediate (temporary) tables with bit values that indicate changes. Views can also be used. Subsequently, these temporary tables (or views) could be joined with the main table.
In Sybase System 11, we can use encoded characteristic functions in conjunction with the SubString function to get the values from the actual tables. Microsoft SQL Server 6.x makes life easier by providing a CASE expression.
Final notes
Columnist Andrew Zanevsky was kind enough to review this article (as was Steve Stallings, a former colleague of mine at Cotelligent Group). Here are Andrews comments:
"CASE is, in all respects, better than characteristic functions. It shows the same or even better performance. Its tremendously easier to program and support. Ive been advising all of my clients to use CASE instead of characteristic functions ever since CASE became available. Characteristic functions were a clever way of working around T-SQL deficiencies in old versions of SQL Server. Many other computer languages and tools always had an immediate IF built-in function. For example, dBase/Clipper/FoxPro have IIF(), Oracle PL/SQL supports DECODE, even Excel has an IF() function in its expressions.
"Encrypting logic with SIGN, ABS, and CHARINDEX was done only because T-SQL lacked a more convenient function. Now you have CASEso forget characteristic functions. Theyre usually hard to explain to others and a pain to support. Sybase didnt support CASE until version 11.5.
"Id use the syscolumns table instead of your proprietary CHANGE_CODES. The only difference is that you wouldnt have an equivalent of your Active column. colid values can be used to compute Change_Bit values for each column as power(2,colid). This would simplify your maintenance."
(Note: "Encoded characteristic functions" is a registered trademark of D. Rozenshtein, A. Abramovich, and E. Birger, who published a paper describing it in the now-defunct SQL Forum, 2(2), March-April 1993.)
Download
SHRIDHAR.SQLShridhar Balasubramaniam works in southern California as a senior information systems consultant with Cotelligent Group, formerly American Digital Technologies, Inc. Hes a SQL Server MCP and Certified PowerBuilder Developer Associate with more than five years of experience in client/server systems. Hes been working with SQL Server for almost four years. However, client/server is only his secondary passion. His primary interests are music, sports, and meeting fun-loving people. sbalan@ameriquest.net.