Mapping from Objects to Tables

Each data member must be preserved during the process of transformation from objects to tables. This includes all "simple" attributes (native types, such as

int

or char
), attributes aggregated in sub-objects, and pointers to associated objects. Note that this also includes data members from all of your ancestors in the static object model.

One-to-One Mapping vs Use of Blobs

Mapping each attribute to a database table column can dramatically slow the performance of the system. In this one-to-one mapping model, each record is built by setting the value of the column corresponding to a particular attribute to the value of that attribute. For many attributes, the value must be transformed to match the requirements of the database. For example, a date may be stored differently in the database than it is in your object.

An alternative method is to store the entire object as a binary stream of bytes, known as a blob. Key values are stored as attributes in columns, but the bulk of the data is undifferentiated. The database cannot manipulate or sort on that data, but by using the keys, it can retrieve the blob, which can then be reconstituted in the application.

The decision of which fields to put in the blob is critical. This decision changes as the physical design of the database is modified during the development and maintenance cycles. Indexes can only be put on columns, not on a part of a blob — because the blob is just binary bits to the database. This forces us to decide once and for all time which fields to key on. The decision to add new keys requires a change to the definition of the table as well as a redefinition of the blob — an expensive maintenance decision.

This makes the design more viscous— it is harder to change as requirements shift. For this reason, many developers don't use a blob; they just make columns for their data and accept the trade off in performance. You can "throw money" at performance issues by buying bigger and faster machines, but it is more difficult (and often more expensive) to try to solve maintenance issues with cash.

Object Identifiers

In C++, we use pointers to name one object in the description of another object. This is done, for example, to capture associations. In mapping to database tables, this naming function is provided by object identifiers (OIDs). An OID is typically generated by the persistence framework as each object is saved for the first time to the database.

An OID is typically implemented using a database type, such as a

long

or a char[N]
, where N
is between 6 and 12. OIDs should be unique, but how unique? Arguably, they should be unique across the system, but ought they be unique across time? That is, can deleted OIDs be reused? Further, should they be unique across instances of the program (can two processes use the same OID to refer to different objects?) Should they be unique across all machines on the network? What if the network is the Internet?

An OID that is absolutely unique is very hard to create. In theory, it would be unique across the world and across time. Microsoft offers such OIDs, called Globally Unique Identifiers (GUIDs). These are used to create unique identification for classes, interfaces and any other object. Microsoft GUIDs are 128-bit unique numbers generated by a complex algorithm, which is statistically unlikely to create duplicated numbers any time soon.

This algorithm, however, could end up creating the year 3400 problem. While I'm told you can use Microsoft's GuidGen.exe to create enough unique identifiers to name every atom in the universe, there is a known problem in their design. Each GUID includes 60 bits for the timestamp, representing the count of 100-nanosecond intervals since 00:00:00.00 on October 15, 1582. This clock will rollover in the year 3400. Watch for books on the Y3.4K problem, appearing in your bookstores early in the fourth millenium.

Mapping C++ Data Elements to the RDB

Each data member that is mapped to a column in a database table is represented in C++ as a primitive data type, a user-defined type or a pointer or reference to an object. Each of these must be mapped to a type that is native to the database.

Typically, the RDB will support various integer and floating formats, but they may not be the same as the ones supported by the C++ compiler you are using. The database type may be best implemented using a user-defined class in C++. Database vendor-specific code needs to be written to support these translations. Translators can be implemented by following the Strategy Pattern.

The Strategy (or Policy) Pattern (Design Patterns, Gamma et al.) is used when you may have a number of implementations to accomplish a single goal. In this case, a number of vendor-specific implementations must support the goal of translating between the C++ objects and the database itself.

The Strategy Pattern separates the translation work from the object that is doing the translation. The class responsible for translation has a

Translator

object. Translator
is an abstract base class from which you can derive concrete classes to encapsulate the specific translation algorithms for the different databases you might use. You can then plug in the appropriate derived translator object into any class that needs to interface with the database.

Fixed Length vs Variable Length Strings

Character strings typically come in fixed-length and variable-length formats in databases, and you may find yourself using both types. Fixed length strings are fast, but they can waste space. Variable length strings don't waste space, but they bring some overhead that can slow performance. The usual heuristic is to use fixed length strings for strings which are within narrow size limits, and then to use variable length strings for memos, notes and other fields of unpredictable length.

NULL Fields

Databases typically support the notion of a

NULL

field. This is a field whose value is unknown. Note that there is a difference between an empty string and a NULL
character string — one says that we have a value and it is of zero length, the other says we have no value at all.

Representing Inheritance

The diagram below defines class

and two subclasses B
and C
:

The next diagram shows the memory layout for these classes:

Note that the attributes defined for class

, in this case A.a1
and A.a2
, are part of any object of class B
or C
. Thus, the question is this — how do we define our tables for the objects of our base class and our derived classes? It turns out that there are three schemes that tradeoff disk space and database performance. These options are:

Put the attributes defined in class A
into one table, those defined directly in class B
in another table and those defined directly in class C
in a third table
Put those attributes defined directly in class B
in one table, those defined directly in class C
in a second table, and fold the attributes defined in class A
into both tables
Put all the attributes defined in class A
, class B
and class C
into one big table, and leave NULL
the attributes that are not meaningful for this type of subclass — this is known as the flat table approach

The first approach corresponds to one class mapping into one or more tables. The diagram below shows the layout of the tables for classes

, B
, and C
under this approach. Note that the OID of the base class is captured as a column (or foreign key) in each of the tables that represent the derived classes. A join must be then done to recreate the object from the tables. This approach uses less space than the flat table approach because there are no extraneous NULL
fields — the fields of each table map to each attribute of the corresponding object. The approach, however, runs slower than the other two due to the joins.

Table A	Table B	Table C
`a1`	`b1`	`c1`
`a2`	`A.OID`	`c2`
`OID`	`OID`	`c3`
Date of Creation	Date of Creation	`A.OID`
// other attributes	// other attributes	`OID`
		Date of creation
		// other attributes

The layout of the tables for the classes

, B
and C
under the second approach is shown below. In this case, one class maps into one table. Note that the attributes defined directly in the base class are captured as columns in each of the tables that represent the derived classes. No join is required to recreate the object from the tables. This approach uses less space than the flat table approach, and runs faster than the earlier approach, because there isn't a join.

Table B	Table C
`a1`	`a1`
`a2`	`a2`
`b1`	`c1`
`OID`	`c2`
Date of Creation	`c3`
// other attributes	`OID`
	Date of Creation
	// other attributes

The final diagram shows the layout of the table for classes

, B
, and C
under the flat table approach — that is, more than one object maps into just one table. Note that all of the attributes defined for all of the classes are represented in the table definition for the derived classes. No join is required to recreate the object from the tables. This approach uses more space than the other two approaches but may run faster because they are always loaded. On the other hand, it may run slower, as the tables are larger in this case, and may require more disk I/O. There will be many extraneous NULL
fields. For example, an object A
stored in this table would have to leave NULL
all the fields for the attributes b1
, c1
, c2
and c3
.

Flat Table

a1

a2

b1

c1

c2

c3

OID

Date of Creation

// other attributes

Representing Relationships

How we represent the relationship between two objects depends on the multiplicity of the relationship. That is, one-to-one relationships are represented differently from one-to-many or many-to-many. You find this multiplicity in the object model and must implement it in the database.

There are two ways of mapping the relationships between the objects into the database. In the first approach, a relationship between two objects is mapped into the database by using the OID of one object as the value of a column in the table that represents the other object. In the second approach, a separate table is used to capture the relationship information. The decision on which approach to use in a specific case is based on:

The multiplicity of the relationship
Whether it is a unidirectional or bidirectional relationship

One-to-One Relationships

If we were representing a one-to-one bidirectional relationship, then the table definition for each object should include the OID of the other object as a "pointer" to the identity of that object. For example, suppose you have a

transaction

object and a receipt
object. Each transaction
has one receipt
and each receipt
represents a single transaction
. If you modeled the receipt
as one record and the transaction
as another record, you would put the ID of the receipt
into the transaction
record and the ID of the transaction
into the receipt
record.

On the other hand, if it is a one-to-one unidirectional relationship then the object that is being pointed to has the OID of the object doing the pointing. This seems backwards. What is going on here is that we have to search the table defining the pointed-to objects, in order to find the one that has the OID of the pointing object in it. Thus, the rule is that the pointer goes "the other way" in the table.

Assume, for example, that the

receipt

object needs to find the transaction
, but there is no reason for the transaction
to find the receipt
. In this case, you have a one-to-one, unidirectional relationship (from receipt
to transaction
). You might expect that the receipt
would have the ID of the transaction
, but in fact it goes the other way — that is, the database representation of the object being pointed to (the transaction
) has the OID of the object doing the pointing (the receipt
). Thus, you can ask the database to find the transaction
with the OID of the receipt
.

One-to-Many Relationships

This rule about unidirectional relationships makes much more sense when you are looking at a one-to-many relationship. For example, assume you have a

VideoTitle

which has many actual VideoTape
s. The VideoTitle
points to each of the VideoTape
objects (and, for whatever reason, you model that as a one-way relationship — the VideoTape
s

never need point back to the VideoTitle
). In this case, it makes sense for the object being pointed to (VideoTape
) to have the OID of the object doing the pointing (VideoTitle
). Now you could query the database for every tape that has this title in its target field.11

Let me give you another example from VideoMagix. Imagine a club that some customers might belong to, the most frequent borrowers, perhaps. We could model this using a one-to-many unidirectional relationship between the club and the customers. To keep things simple, let's assume that a customer can belong to one and only one club. The record for each

Customer

object would keep the ClubID
as an entry in its record, whereas if it were the other way around, then the record for the club would have to store many CustomerID
s.

If it is a many-to-many bidirectional relationship, then putting the OID of either object as the value of a column in the other object will not work. In this case, a separate table, known as a join table, is used.

Many

Customer

s will rent each VideoTape
, and each Customer
will rent many VideoTape
s. It wouldn't work to store every CustomerID
in every VideoTape
, nor does it work to store the ID of every VideoTape
in every Customer
(who knows how many values we'll need). It is therefore much more efficient to create a join table, which joins Customers
to VideoTapes
, where each entry corresponds to a single CustomerID
and a single VideoTape
.

Representing Collections

There are two reasons that collections are generated from the database. The first way is in the representation of associations that are not one-to-one. In this case, the object on at least one end of the association needs a collection to represent the set of objects on the other end of the association.

The second way that collections are used is as a representation of the results of a query. Queries typically result in a collection of objects, or a collection of strings that describe these objects. Each of these sets of resultsshould be represented in the return value of the query function with an instance of a collection class.

In the MFC, queries produce

RecordSet

s. The RecordSet
is an iterator over the set of records that satisfy the query. The actual collection of the records is hidden behind the interface to the RecordSet
; you use the RecordSet
to iterate over each record in turn.

If you are writing your own collections to manage the results of queries, you'll want to store the results in a collection and provide an iterator for that collection. The typical implementation will use an STL collection and an STL forward iterator. Member functions will also be needed, if only to append to the collection, add by position, delete from the front or back, delete by position and get the length etc. These collections can typically be implemented as template classes, as all of the objects that satisfy a query are of the same type or are all descended from a common base type.