Normalizing the Data

The task of a database designer is to structure the data in a way that eliminates unnecessary duplication and provides a rapid search path to all necessary information. The process of refining tables, keys, columns, and relationships to create an efficient database is called normalization. Normalizing is not just for relational files: it's also a common design activity for indexed files.

Normalization is a complex process with many specific rules and different levels intensity. In its full definition, normalization is the process of discarding repeating groups, minimizing redundancy, eliminating composite keys for partial dependency, and separating nonkey attributes. In simple terms, the rules for normalization can be summed up in a single phrase: "Each attribute (column) must be a fact about the key, the whole key, and nothing but the key." Said another way, each table should describe only one type of entity (such as a person, place, customer order, or product item).

Some of the benefits of normalization are:

Data integrity (because there is no redundant, neglected data).
Optimized queries (because normalized tables produce rapid, efficient joins).
Faster index creation and sorting (because the tables have fewer columns).
Faster UPDATE performance (because there are fewer indexes per table).
Improved concurrency resolution (because table locks will affect less data).

You can normalize most simple databases by following a simple rule of thumb: tables that contain repeated information should be divided into separate tables to eliminate the duplication.

For example, let's say your new application is for a bookseller who must track information about each book, including the following data.

Author name
Author address
Author phone
Title
ISBN (International Scientific Book Number)
Year published
Publisher name
Publisher address
Publisher phone

You could simply create a single table with a field for each of the data items listed. Looking closely at the data, however, it's obvious that such a table would contain many redundancies. For example, many authors have written more than one book, so the publisher information for each book title would be repeated many times. If you put all of these fields into a single table there would be many confusing and duplicate entries.

Using the principles of normalization, you might break the data into four groups: Authors, AuthorsTitles, Titles, and Publishers, as shown in the following table.

Authors table	AuthorsTitles table	Titles table	Publishers table
au_id (key)	au_id (foreign key)	ti_isbn (key)	pu_id (key)
au_name	ti_isbn (foreign key)	ti_title	pu_name
au_address		ti_yearpublished	pu_address
au_phone		pu_id (foreign key)	pu_phone

The keys provide a means of establishing table relationships. For example, the AuthorsTitles table creates a one-to-many relationship between the Authors and Titles tables (an author may write many titles). Using the AuthorsTitles table, you can query for every book number an author wrote (using au_id), and also determine which author wrote a certain book (using ti_isbn).

It's worth noting that instead of creating an AuthorsTitles table, an alternative approach would be to add the au_id attribute into the Titles table. This option is neither right nor wrong: it's simply one of those design tradeoffs where you must evaluate your application's expected query types, possible multiuser concurrency issues, and the possible performance problem of three indexes on one table.