David Willson
Microsoft Developer Network
March 1999
Summary: Discusses the process and benefits of normalizing the Duwamish Books sample database for use with a Web application. (6 printed pages)
Moving Duwamish Books to the Web in Phase 4 required us to reexamine our database. Many business systems begin as spreadsheets or, as in the case of Duwamish Books, simple databases. When there are relatively few records or infrequent use, these systems are easy to maintain. However, when the speed of business picks up, maintenance and use of the data become increasingly difficult. Normalization is a good way to simplify data entities.
In the case of Duwamish Books, the basic database structure has been inherited from Phase 1, when it was a simple Microsoft Access database meant to support a single mom-and-pop operation. With the business moving to the Web in Phase 4, we realized that the database needed to be normalized.
Before entering into a discussion of normalization, let's clarify a few concepts and terms:
One of the primary goals of normalizing a database is to organize data entities into simpler, more stable data structures. To do this, identify the columns that describe the entity and move the columns that do not uniquely identify the entity to a new table.
For example, consider a table named Books that has the following six columns: PKId (a primary key or unique identifier), Title, AuthorName, CopiesAtStore1, CopiesAtStore2, and CopiesAtStore3. Only the Title and AuthorName columns describe the key (a book). These two columns are the core components of a book entity. The last three columns describe inventory and stores. Ideally, two new lookup tables should be created to store these three columns—a Stores lookup table to identify stores and an Inventory lookup table to track copies. It would then be necessary to create two foreign key data relationships in the new Inventory lookup table—one to the primary key of the Stores lookup table and the other to the primary key of the normalized Books table.
Normalization also involves reducing anomalies or duplications by decomposing a table into multiple tables. If it is necessary to maintain consistency in the database, decomposing the offending table into two or more tables provides a means to enforce this consistency.
Consider the preceding example. The Books table now consists of three columns—PKId, Title, and AuthorName. If we were looking for a particular book, we might want to produce a resultset of books written by a particular author. If there are variations in the spelling of the author's name, we will not get the complete resultset that we expect. The following recordset selection of books authored by John Steinbeck indicates this problem. Notice how spelling variations could cause a problem when searching for all book titles of a specific author.
PKId AuthorName Title
-------- ---------------------- ----------------------
55 Steinbeck East of Eden
56 Steinbeck, John Cannery Row
57 J. Steinbeck Grapes of Wrath
58 John Steinbeck Of Mice and Men
It could be time-consuming to fix data problems such as this. We can prevent this problem by decomposing the Books table even further. If we create an Authors table, we can maintain the spellings of the author names separate from the remaining columns of the Books table and enforce consistency in the spelling of the author's name. A foreign key must be appended in the Books table to refer to the appropriate record in the Authors table. See this Recordset selection of the normalized Books table:
PKId AuthorId Title
-------- -------- ----------------------
55 14 East of Eden
56 14 Cannery Row
57 14 Grapes of Wrath
58 14 Of Mice and Men
59 14 Tortilla Flat
Here is what the Recordset selection of the Authors lookup looks like:
PKId AuthorName
-------- ----------------------
14 John Steinbeck
Normalization can be taken to an extreme. In systems where it is necessary to conserve as much space as possible, duplicate data should be avoided. So, in the preceding example, our database may contain two different books with the same title. We could decompose the Title column into its own table, Titles, allowing the database to hold that title just once. Of course, a foreign key must be installed in the Books table to refer to the appropriate record in the Titles table.
Normalizing a database is not a trivial matter. The more normalized the database becomes, the more cumbersome and slow the joins are when we query the database. Therefore, we must be very careful when we index columns. You must define foreign key dependencies. To insure that all tables are modified correctly, Insert, Update, and Delete procedures must incorporate transaction processing. Therefore, it is necessary to balance the gain of simplified maintainable structures against possible complicated and unintuitive procedure code.
Phase 3 of Duwamish Books is a legacy system. In addition to enhancing the system for Phase 4, we reviewed and normalized the entity relationships in the database model.
Our goal in normalizing the Duwamish Books database was to simplify the database and to provide tables that would be easier to maintain with larger data volumes and increased activity. In the course of normalizing, we decomposed many tables, as the following sections describe.
Five tables in the Phase 3 database model (Contacts, Customers, Employees, Orders and Stores) were each overloaded with a repeating set of address information: Address1, Address2, City, State, ZipCode, and PhoneNumber.
To simplify the columns in these five tables, and to centralize the address information in the database for easier maintenance, we removed the address columns from each of the five affected tables and incorporated them into the new Address table. For each of these five tables, the address columns were replaced by a foreign key to the primary key in the Address table.
In reviewing the relationship between an author and an item, we determined a need for a many-to-many relationship because one author may be associated with many books or a single book may have many authors. In the Phase 3 model, each item was allowed only one author. To create a many-to-many relationship a joining table, ItemAuthor, was created. The joining table contains two foreign keys—one foreign key to the primary key in the Items table, the other to the primary key in the Authors table.
In the Phase 3 model of the Duwamish database the Orders and Sales tables were maintained as separate entities. Both tables host a common set of columns, which define the sale or order's uniqueness. Also, both tables are used to describe the movement of inventory with respect to a specific customer. For these reasons we consolidated all the duplicate columns from both tables into one Orders table, effectively making a sale a type of an order.
Once we consolidated the Sales and Orders tables, it became obvious that the SaleDetails and OrderDetails tables, with identical column names and functionality, also should be consolidated. We named this consolidated table TransactionDetails.
To simplify the Orders table, we decomposed the sales-specific columns, such as PaymentType and CreditCardNumber to another table, which we then called Sales. The new Sales table has a one-to-one relationship with the Orders table and a foreign key reference to the primary key of the Orders table.
The Items table can contain more than just books. Someday our store may sell other items, stationery for instance. Of course, stationery does not have a Publisher, PublicationYear, or an ISBN, so we have isolated these columns from the Items table to a new table called Books. The remaining columns of the normalized Items table are attributes of all items, whether they are books or stationery.
The Phase 3 installation script contained instructions to build tables that we will not be using (in some cases, not even related to Duwamish), redundant indexes, and numerous default objects. To prepare the database for optimization, we needed to review the installation script and to identify what needs to remain in the database and what should be maintained elsewhere.
We used the Upsizing Wizard to migrate the Phase 2 Access database to SQL Server, creating the installation script for the Phase 3 system. However, the script the wizard created was not optimal. It installs triggers instead of foreign key constraints. Also, numerous objects of type default were created in the installation script rather than explicitly declaring a default in the table declaration. Plus, there were redundant indexes imposed on the system.
The Upsizing Wizard doesn't know which entities are not part of the system. If there are any test tables or procedures present before the wizard is started, they will be incorporated into the script.
Both the OrderType and the OrderStatus tables were static in the Phase 3 database—this data never changed. By moving the validation of these settings to the Business Logic Layer (BLL) component of the Duwamish system, we were able to eliminate two of the foreign key constraints on the Orders table.
Because tax computation is a uniquely complicated operation beyond the scope of our system, we elected to remove the TaxRate table from the database. The accompanying foreign key constraint in the Items table referencing the primary key in the TaxRate table was also removed.
Instead of storing tax computation information, we use a tool that simulates the calculation of tax based on price, product type, and the municipalities of both the customer and the recipient. The result of the tax calculation is stored in the Sales table at the time of a sale.
From time to time, disconnected Recordset objects are generated by the BLL, which can be edited by an AddNew or Update method in a client program. When we create a disconnected ADO recordset, all columns in the selection clause must be defined in one of the sysobjects tables on the server. (In the case of a temporary table selection, the sysobjects table is in the tempdb database.) A column becomes defined in the system tables when the table that contains a column in the selection has been defined by a CREATE TABLE statement. Recordsets containing unidentified column data types will yield the expected output, but a type-mismatch error will be generated when the AddNew or Update method is called on that recordset.
Some anticipated workarounds for this might be to alias the column selection as follows:
CONVERT(INT,0) AS 'Quantity'
Or:
COUNT(*) AS 'MyCount'
However, it is not necessary to install a permanent table in the system for this reason. A reasonable solution to this problem is to create a temporary table prior to making the column selections. Create a join to the temporary table during the selection, and then drop the table after the selection.
The ThisOrderTrans table was created for this reason. Once we identified the procedures making selections of empty columns and replaced references to ThisOrderTrans with a temporary table, we removed the ThisOrderTrans table from the database.
To accommodate name and address styles that vary from one locale to the next, we converted instances of the FirstName and LastName columns into a single Name column. The Address1, Address2, City, State, and ZipCode address structure works well for U.S. addresses but not for all foreign addresses.
Normalizing the Duwamish database helped us create a relational database that more accurately reflects the true business relationship of the data entities. We were able to do this by isolating the columns that most uniquely define an instance of that entity, moving other columns to one or more lookup tables, and applying the proper data relationships.
We removed tables that do not need to be referenced by other objects in the database. We removed unnecessary triggers and redundant indexes that were being applied by the installation script. Elimination of unnecessary objects and simplification of the normalized tables help to make the database easier to understand and less time-consuming to maintain.
Comments? We welcome your feedback at duwamish@microsoft.com.