Normalization

Normalization is the process of eliminating duplication of data from within your database. Duplicate data can cause your database to grow too large. More important, having the same data in two places can create significant problems when the data get out of synch with one another. This is known as database corruption, and is not only difficult to prevent, but is even more difficult to correct.

If the customer table lists customer John Doe as living at 1 Main Street and this information is duplicated in the club membership table, your database will be larger than it could be. Furthermore, if you update his address in the customer table, but fail to update his record in the club membership table, then you have two different addresses for the same customer.

Normalization aims to reduce this duplication. Database theorists have defined several normal forms for relational databases. These are a series of increasingly restrictive constraints on the layout of the database tables.

For example, the first normal form, which is the least restrictive, requires that each attribute only store one value (as opposed to a list of values). The second normal form builds on first normal form and adds the constraint that each row in the database be uniquely identifiable. This is typically accomplished by using an identity column — a unique ID for each row. If you have no other value on which to index you may make the identity the primary key, but these two concepts are quite different. The identity ensures that each field is unique; the primary key is the principal sorting mechanism.

The third normal form, called 3NF, is the most popular compromise between too little normalization and too much. In 3NF, you essentially remove all redundant information in your tables.

Denormalization

There are normal forms beyond the third one, but they put such difficult constraints on the database that few developers go beyond this point. In fact, even the third normal form can cause significant performance penalties. For example, your customer's name should not appear both in the customer table and in the club table. This is duplication of non-key data and a violation of third normal form. What you are supposed to do is to get the customer's record from the club table, look up his ID, and then look him up in the customer table. This is a lot of overhead, however, and you almost always need the customer's name when you are looking up his record. Many developers will intentionally violate this rule — they will deliberately denormalize their tables to recover the performance penalty exacted by rigorous normalization.

DBAs typically start by designing their tables to a given normal form (say, the third normal form) and then they will begin a denormalization process in order to improve the performance of the system. It is not unusual to start with a completely normalized database and then to denormalize extensively when the system runs too slowly. The database that eventually emerges is denormalized where necessary for performance reasons. This has proven to be a better methodology than starting with a denormalized database and adding normalization, or, worse, ignoring the whole issue altogether.