The task of a database designer is to structure the data in a way that eliminates unnecessary duplication and provides a rapid search path to all necessary information. The process of refining tables, keys, columns, and relationships to create an efficient database is called normalization. Normalizing is not just for relational files: it's also a common design activity for indexed files.
Normalization is a complex process with many specific rules and different levels intensity. In its full definition, normalization is the process of discarding repeating groups, minimizing redundancy, eliminating composite keys for partial dependency, and separating nonkey attributes. In simple terms, the rules for normalization can be summed up in a single phrase: "Each attribute (column) must be a fact about the key, the whole key, and nothing but the key." Said another way, each table should describe only one type of entity (such as a person, place, customer order, or product item).
Some of the benefits of normalization are:
You can normalize most simple databases by following a simple rule of thumb: tables that contain repeated information should be divided into separate tables to eliminate the duplication.
For example, let's say your new application is for a bookseller who must track information about each book, including the following data.
You could simply create a single table with a field for each of the data items listed. Looking closely at the data, however, it's obvious that such a table would contain many redundancies. For example, many authors have written more than one book, so the publisher information for each book title would be repeated many times. If you put all of these fields into a single table there would be many confusing and duplicate entries.
Using the principles of normalization, you might break the data into four groups: Authors, AuthorsTitles, Titles, and Publishers, as shown in the following table.
Authors table | AuthorsTitles table | Titles table | Publishers table |
au_id (key) | au_id (foreign key) | ti_isbn (key) | pu_id (key) |
au_name | ti_isbn (foreign key) | ti_title | pu_name |
au_address | ti_yearpublished | pu_address | |
au_phone | pu_id (foreign key) | pu_phone |
The keys provide a means of establishing table relationships. For example, the AuthorsTitles table creates a one-to-many relationship between the Authors and Titles tables (an author may write many titles). Using the AuthorsTitles table, you can query for every book number an author wrote (using au_id), and also determine which author wrote a certain book (using ti_isbn).
It's worth noting that instead of creating an AuthorsTitles table, an alternative approach would be to add the au_id attribute into the Titles table. This option is neither right nor wrong: it's simply one of those design tradeoffs where you must evaluate your application's expected query types, possible multiuser concurrency issues, and the possible performance problem of three indexes on one table.