Normalization

The logical design of the database, including the tables and the relationships between them, is the core of an optimized relational database. A good logical database design can lay the foundation for optimal database and application performance. A poor logical database design can impair the performance of the entire system.

Normalizing a logical database design involves using formal methods to separate the data into multiple, related tables. A greater number of narrow tables (with fewer columns) is characteristic of a normalized database. A few wide tables (with more columns) is characteristic of an unnormalized database.

Reasonable normalization will often improve performance. When useful indexes are available, the Microsoft® SQL Server™ query optimizer is efficient at selecting rapid, efficient joins between tables.

Some of the benefits of normalization include:

Faster sorting and index creation.
A larger number of clustered indexes. For more information, see Clustered Indexes.
Narrower and more compact indexes.
Fewer indexes per table, which improves the performance of INSERT, UPDATE, and DELETE statements.
Fewer NULL values and less opportunity for inconsistency, which increase database compactness.

As normalization increases, so will the number and complexity of joins required to retrieve data. Too many complex relational joins between too many tables can hinder performance. Reasonable normalization often includes few regularly executed queries that use joins involving more than four tables.

Sometimes the logical database design is already fixed and total redesign is not feasible. Even then, however, it might be possible to normalize a large table selectively into several smaller tables. If the database is accessed through stored procedures, this schema change could take place without affecting applications. If not, it might be possible to create a view that hides the schema change from the applications.

In relational database design theory, normalization rules identify certain attributes that must be present or absent in a well-designed database. While a complete discussion of normalization rules goes well beyond the scope of this topic, there are a few rules that can help you achieve a sound database design:

A table should have an identifier.
The fundamental rule of database design theory is that each table should have a unique row identifier, a column or set of columns that can be used to distinguish any single record from every other record in the table. Each table should have an ID column, and no two records can share the same ID value. The column or columns serving as the unique row identifier for a table are the primary key of the table.
A table should store only data for a single type of entity.
Attempting to store too much information in a table can prevent the efficient and reliable management of the table’s data. In the preceding example of the pubs database, the titles and publishers information is stored in two separate tables. While it is possible to have columns for both the book and its publisher’s information in the titles table, this design leads to several problems. The publisher information must be added and stored redundantly for each book that is published by a publisher. This uses extra storage space in the database. If the address for the publisher changes, the change must be made for each book. When the last book for a publisher is removed from the title table, the information for that publisher is lost.
The pubs database stores the information for books and publishers in the titles and publishers tables. The publisher information must be entered only once and linked to each book. When the publisher information is changed, it must be changed in only one place, and the publisher information will be there even if the publisher has no books in the database.
A table should avoid nullable columns.
Tables can have columns defined to allow null values. A null value indicates that there is no value. While it can be useful to allow null values in isolated cases, it is best to use them sparingly because they require special handling that increases the complexity of data operations. If you have a table with several nullable columns and several of the rows have null values in the columns, you should consider placing these columns in another table linked to the primary table. Storing the data in two separate tables allows the primary table to be simple in design but able to accommodate the occasional need for storing this information.
A table should not have repeating values or columns.
The table for an item in the database should not contain a list of values for a specific piece of information. For example, a book in the pubs database could be coauthored. If there is a column in the titles table for the name of the author, this presents a problem. One solution is to store the name of both authors in the column, but this makes it difficult to show a list of the individual authors. Another solution is to change the structure of the table to add another column for the name of the second author, but this accommodates only two authors. Yet another column must be added if a book has three authors.
If you find that you need to store a list of values in a single column, or if you have multiple columns for a single piece of data (au_lname1, au_lname2, and so on), you should consider placing the duplicated data in another table with a link back to the primary table. The pubs database has a table for book information and another table that stores just the ID values for the books and the IDs of the books’ authors. This design allows any number of authors for a book without modifying the definition of the table and allocates no unused storage space for books with a single author.