Serialization

Serialization is the process of taking objects and converting their state information into a form that can be stored or transported.

The process of serialization involves turning pointers into object references or IDs, and converting binary data into a representation as a stream of bytes. How the receiving object interprets this stream of data is entirely up to an agreed-upon protocol. It may be that the receiving object just stores the stream "as is", or it may be that the object is converted into entries in a database.

Other uses of serialization include the marshalling process used by inter-process communication mechanisms (such as CORBA or DCE) and formatting data for communication over low-level network technologies such as TCP/IP sockets.

In the case of persisting to a RDB, the receiving object is your persistence mechanism: an object which converts the stream of data into relational table entries. When the application retrieves objects, the persistence mechanism reads from the database and provides an object back to the application as a stream of bytes. The application then "reconstitutes" the object from the stream. With proper encapsulation, the application always produces and receives a stream, and thus is indifferent to the structure of the database.

Application classes should have no knowledge of the external representation format. As the interpretation of the stream changes, the application classes should be unaffected.

One way to enforce this separation of concerns is presented in the Serializer design pattern in PLOP 3 (Pattern Languages of Program Design 3, edited by Martin et al., ISBN 0-201-31011-2, Addison-Wesley). This pattern allows your application to read from and write to a variety of storage mechanisms using a uniform mechanism. The pattern makes use of a reader and a writer. The writer writes the object out to a serial stream and the reader uses the serial stream to recreate the object(s). A different reader/writer pair can be used for each persistent store that needs to be supported, but nothing else in your application need change.

To support this capability, an abstract base class called Serializable is defined. Any class that needs to be serialized must inherit from Serializable. This interface defines two methods: a readFrom method that takes a reader as a parameter, and a writeTo method that takes a writer as a parameter.

The reader and writer are implemented as classes which typically overload the operators << and >>. These will be overloaded to take each of the primitive data types (e.g. int, long, short and so forth). Here is an excerpt from a typical version of such a class:

class Writer
    {
    public:
        virtual Writer& operator<<(int&);
        virtual Writer& operator<<(long&);
        virtual Writer& operator<<(short&);
        virtual Writer& operator<<(char*);

        //...
    };

class Reader
    {
    public:
        virtual Reader& operator>>(int&);
        virtual Reader& operator>>(long&);
        virtual Reader& operator>>(short&);
        virtual Reader& operator>>(char*&);

        //...
    };

class Serializable
    {
    public:
        Serializable readFrom(Reader&) = 0;
        virtual void writeTo(Writer&) = 0;
    };

This pattern shows you how to read and write all the built in primitive types such as int and long, but begs the question of how you read and write complex user-defined types. The answer is that each object delegates responsibility to read and write any non-primitive data members to the member itself! That is, user-defined types write out their primitive members and tell their user defined members to write themselves. Each one in turn recursively delegates this responsibility. Ultimately, every user-defined type is composed of primitive types, so that every object can be written and read.

Let's take a simple, if fanciful, example. Assume that you have the following classes:

class Employee
{
public:
   //... 
private:
   Address myAddress;
   int myAge;
};

class Address
{
public:
   //...
private:
   Street myStreet;
   char* myCity;
   char* myState;
};

class Street
public:
   //...
private:
   int myNumber;
   char* StreetName;
};

When you tell an Employee object to persist itself, it will write the variable myAge (an integer) and will tell the variable myAddress to write itself. The Address object will write out myCity and myState but will delegate to the Street class the responsibility to write out myStreet. myStreet has only primitive member variables (an int and a char*) and so can write itself out without any further delegation.

Thus, serializing an object is done by following this algorithm:

1Ask your parents in the inheritance hierarchy to serialize themselves
2Serialize your primitive data members
3Serialize your object pointers (discussed below)
4Ask your sub-objects (those user-defined objects of which you are composed) to serialize themselves

This last act — asking your member objects to serialize themselves — allows for the implementation of serialization to be fully encapsulated and localized in that object which knows best how to do it. This means that, as you add new classes, the Serializable base class need not change. Classes that contain a new type do not need to change either; the responsibility for serializing that new type is encapsulated in the new type itself.