Database Normalization Requirements for English Query Applications

Microsoft English Query applications work best with normalized databases. In general, it is easiest to create English Query applications against normalized databases. In addition, the resulting applications are more flexible and powerful than those developed against nonnormalized databases.

This topic describes normalization rules as they pertain to English Query applications. It describes problematic database structures that break these rules and how to solve these problems by creating views in Microsoft® SQL Server™, which can be used in English Query just like any other table.

Rule 1: There should be a one-to-one relationship between the instances of an entity and the rows of the table.

For every table that represents an entity, each and every row in that table should represent one and only one instance of that entity. Conversely, each and every instance of that entity should be represented by one and only one row in the table.

Here is a situation in which this rule is not met:

Table:    Employees
Fields:    Emp_id, Emp_name, Status, Position, Salary
Keys:    Emp_id, Status

This table stores information about employees. It contains their names, positions, and salaries. But sometimes employees move around from position to position, and when they do, their salaries change. So for some employees, this table also stores information about what their position and salary are projected to be. If the value of the field Status is “C,” then the row contains the current information for the employee. If the value is “P,” then it contains the projected information. So, an individual employee may appear twice in this table.

Because an employee can appear twice, you cannot use this table to represent the entity employees. If you were to associate the employees entity with this table, even simple requests, such as “Count the employees,” would give the wrong answer. The solution to this problem is to create a view in the database that contains a single row for each employee and to tell English Query about this view.

Here is what the view would look like:

create view Emps as select Emp_id, Emp_name, Position, Salary from employees where status = 'C'

You now have a view that contains exactly one row per employee. The entity employees can now be represented by this view.

Rule 2: A field should have the same meaning in each row of the table.

Let’s look at the Employees table again:

Table:    Employees
Fields:    Emp_id, Emp_name, Status, Position, Salary
Keys:    Emp_id, Status

This table violates the second rule as well. Remember, the fields Position and Salary contain current information if the value of the field Status is “C” and projected information if the value of Status is “P.” The field Position really represents two distinct entities: current position and projected position, depending on the value of Status. This is true for the field Salary as well.

Notice that you have already extracted the current salary and current position when you created the view Emps. Next, you do the same for the projected salary and projected position:

create view Projected_info as select Emp_id, Position Projected_position, Salary, Projected_salary from employees where status = 'P'

Now, you can tell English Query that the entity projected position is represented by the Projected_position field in the view Projected_info, and similarly, Projected_salary.

Rule 3: Each table should represent at most one entity.

Often, if an attribute means one thing for one subset of the rows in the table and something else for some other subset, those subsets of rows actually represent two different (but related) entities. For example, consider a table that represents various productions, either movies or TV programs:

Table:    Productions
Fields:    Prod_id, Title, Type, Show_date, Mpaa_rating, Network
Keys:    Prod_id

If the value of the field Type is “M,” then the production is a motion picture. If the value is “T,” then it is a television show. For motion pictures, the field Show_date contains the date it was released; for television shows, it contains the date the show was broadcast. The field Mpaa_rating, is the rating of a movie (G, PG, PG-13, and so forth.), but is meaningless for television shows. Likewise, the field Network is the network a television show appeared on, but this field is meaningless for motion pictures.

This structure makes authoring an English Query application difficult in several ways. First, television shows and motion pictures are distinct entities. You can tell this by the fact that they participate in different relationships. Second, the field Show_date has a different interpretation for a movie versus a TV show. Finally, the fields Mpaa_rating and Network are meaningless for certain rows in the table Productions.

To define TV shows and movies as distinct entities, you would need to create views in SQL Server:

create view Tv_shows as

select Prod_id, Title, Show_date, Broadcast_date, Network

from Productions

where type = 'T'

create view Movies as

select Prod_id, Title, Show_date, Release_date, Mpaa_rating

from Productions

where type = 'M'

You could then create the entity television show, which is represented by the view Tv_shows, and the entity movie, which is represented by the view Movies.

Note that a superficially similar table would not have the same problem. For example, consider a table of employees that contains a Name, a Salary, and a Type field. Type indicates what type of job that person has (for example, engineer, salesperson, receptionist). Because engineers, salespeople, and receptionists have the same kind of information (names and salaries), there is no need to create a view for each of these job types. Instead, you use the Type field to define subsets of people. Thus, a user could ask, "How many engineers are there?" and get the right answer.

However, if you have the table Productions use the field Type to define movies and TV shows as subsets of productions, instead of creating a view for each, then users could ask nonsensical questions, such as “What is The X-Files rated?” and would get incorrect answers to seemingly valid questions like “When was Star Wars broadcast?”

Rule 4: Multiple instances of an entity should be represented by multiple rows in a table.

When there is a many-to-one relationship between two (or more) entities, it should be represented in the database as a many-to-one join between two (or more) tables. There are two ways this rule is typically violated: by splitting data among multiple columns and by splitting data among multiple tables. The solutions to both problems involve creating union views.

Rule 4a: Multiple instances of an entity should not be represented as multiple columns.

Consider the following database of people and their pets. The designer of this database didn’t anticipate anyone owning more than three pets, and hard-coded three pet IDs in the People table:

Table:    People
Fields:    Person_id, Person_name, Pet_id1, Pet_id2, Pet_id3
Keys:    Person_id

Table:    Pets
Fields:    Pet_id, Pet_name
Keys:    Pet_id

You can create the entity person, which is represented by the table People, and you can create the entity pets, represented by the table Pets. Unfortunately, when you try to create a relationship between people and their pets, you must specify a join path between the People table and the Pets table. You could arbitrarily choose one of the ID fields to join on, but this would (in the end) force the creation of three relationships that would force users to ask questions in awkward ways (for example, “Show the people and their first pets and second pets and third pets”).

Because there is nothing that distinguishes these three pet IDs from one another, a better database design would have a many-to-many join table containing all pet IDs in a single column. To achieve this, create the following union view:

create view Pet_owners as

select Person_id, Pet_id1 as Pet_id

from People

where Pet_id1 is not null

union

select Person_id, Pet_id2 as Pet_id

from People

where Pet_id2 is not null

union

select Person_id, Pet_id3 as Pet_id

from People

where Pet_id3 is not null

Because this view contains a direct relationship between pets and their owners, it can now be used as a join table.

Rule 4b: Multiple instances of an entity should not be represented as multiple tables.

In this financial database, the high-volume budget table is divided into yearly archive tables to avoid having a single table become too large to manage:

Table:    Branches
Fields:    Branch_id, Branch_name
Keys:    Branch_id

Table:    Budgets_1995
Fields:    Branch_id, Budget, Actual
Keys:    Branch_id

Table:    Budgets_1996
Fields:    Branch_id, Budget, Actual
Keys:    Branch_id

Table:    Budgets_1997
Fields:    Branch_id, Budget, Actual
Keys:    Branch_id

To create the entity budget and the entity actual_expense, use a union view to create a single database object to represent each one:

create view Budgets as

select Branch_id, 1995 as year, Budget, Actual

from Budgets_1995

union

select Branch_id, 1996 as year, Budget, Actual

from Budgets_1996

union

select Branch_id, 1997 as year, Budget, Actual

from Budgets_1997

The fields Budget and Actual in the view Budgets can now represent the entities budget and actual_expense.

Rule 5: Joins should be based only on primary- and foreign-key equality.

Sometimes, a relationship between two entities is represented in the database with a nonstandard join. Because English Query only recognizes joins based on equality of primary and foreign keys, any nonstandard join must be translated, by using a view, into a standard join.

For example, this database contains information about people’s heights, as well as a description of the “ideal weight” for people who fall into a certain height range:

Table:    People
Fields:    Person_id, Height, Weight
Keys:    Person_id

Table:    Ideal_weights
Fields:    Min_height, Max_height, Ideal_weight
Keys:    Min_height

Because “ideal weight” depends on height, it is easy to learn the ideal weight for any given person. However, there is no way for English Query to make the connection between the table People and the table Ideal_weights because there is no primary- and foreign-key join between the two tables. You must make this join explicit, by creating a view that contains the ideal weight of each person.

create view Ideal_weights as

select Person_id, Ideal_weight

from People, Ideal_weights

where Height >= Min_height

and Height <= Max_height

This view can then be used as the join table for the relationship between the entity person and the entity ideal weight (which is represented by the field Ideal_weight).

For more information about designing normalized databases, see one of the books on the subject of database design, including: Introduction to Database Systems by C. J. Date, and Database Processing: Fundamentals, Design and Implementation by David Kroenke.