Designing a Data Access Layer

Duwamish Books, Phase 2

Robert Coleridge
Microsoft Developer Network

July 1998

Summary: Discusses costs, benefits, and design issues with creating a data access layer. (8 printed pages) Discusses:

Introduction

A well-developed application, no matter how large or small, should be written with maintenance in mind. One of the best approaches to maintainability is to use a layered paradigm when designing the application. This approach dictates that the application be designed in at least three layers: a presentation layer, a business logic layer, and a data access layer (DAL). These layers can be coded within the application, or they can be separate components outside of the application. For Microsoft ® Visual Basic® programmers, writing a layer into the application involves writing the layer as its own module. For Microsoft Visual C++® programmers, this means writing a layer as its own class. This separation allows for a more easily maintained code base and also ensures that when the time comes to turn the code into a component, the effort required will be minimal. Separating this functionality into a component also makes the code far easier to debug, extend, and reuse.

This article examines some of the design issues involved in writing a DAL, including additional issues when writing the layer as a separate COM component, with particular reference to the MSDN Duwamish Books sample.

In Phase 1 of the Duwamish Books sample, the DAL is a code module, but Phases 2 to 4 use a COM component. For more details on the phases of Duwamish Books see "An Introduction to the Duwamish Books Sample."

Costs

Besides maintainability and code reuse, there are distinct benefits in creating a data access layer. As with every approach, these benefits come with their respective costs.

More Thought in Design

A DAL is a great idea, but it involves more design work. When the code is incorporated throughout the application, the data access can be written "on the fly," as it is needed, without much forethought. However, when data access is moved into its own layer, design decisions must be made before the code is implemented so that the DAL covers all of the data access requirements of the particular application or system. This is even more critical if the DAL becomes a generic COM object that is used by more than one application. This forethought can be perceived as a disadvantage to some who think that the term RAD stands for Rapid Application Design, when in fact the term stands for Rapid Application Development. Attempting to use a RAD for design usually means the term stands for Rapid Application Disaster.

An additional increase in design time is required when creating a data access component (such as the Duwamish DBDAL). An effort must be made to ensure that the data access layer covers all areas of functionality required by the calling application. It would defeat the purposes of a data access component if it covered only 75 percent of the data access requirements and forced the developer to code specific data access functionality within a routine or application. Once again, this requires designing ahead of time but the advantages of this extensible functionality outweigh the disadvantages of coding specific code for each routine.

New API

Once the DAL has been designed with a generic functionality in mind, you must code up the new API/method signatures. This requires some forethought as to what are the best parameters to pass in, what order makes the most sense, and so forth.

Hidden Internals

If you create your DAL as a component, other developers will not be able to see the "gory details" of the native data access mechanisms underlying the DAL. On the other hand, this provides greater security. See the related section below under "Benefits."

Benefits

Duwamish Books

Let's look at the benefits (which, in my opinion, justify the costs) of a data access layer. Once Duwamish Books hit Phase 2, where we componentized the DAL, the code not only got smaller, but also became much easier to debug because debugging the DAL was done once. This separation of data access from the applications also gave us more flexibility in the distribution of the applications.

Increase in Maintainability

One of the greatest advantages of using a data access layer is the increase in maintainability. Too many programs have code living in multiple places or, even worse, woven throughout the application. Separating the DAL within the application makes it much more maintainable.

Creating a data access component provides even more benefits. Separating the DAL from the application not only creates more maintainable code but also far more extensible and reusable code. If and when the underlying data access code changes, this separation alleviates the headache of recompiling and testing each application that uses it. The only code that needs full testing is the data access component itself.

Decrease in Overall Code Size

A separate data access component also results in reduced code size for any dependent application. If the DAL code is embedded in each application, each application carries redundant code. Once the data access code is moved out of the application, the application shrinks and thus allows for a better-distributed environment due to the decrease in distribution bandwidth.

Minimizes Number of APIs

By using a DAL, you reduce the number of application programming interfaces (APIs) that a developer has to learn. They are only faced with an API set that is relevant to the application or suite of applications. The developer can focus their time on developing an application rather than on learning a number of APIs.

Increase in Security

One of the advantages of using a data access component is that it hides the data implementation from developers and/or end users. There is no direct access to the data. Assuming the DAL was designed correctly, data manipulation is limited to ways that are controlled by the application.

Often the owner of data does not want others to see the details of the data structures, organization, and so on. By using a data access component, not only are the data structures hidden, but also the access techniques. By hiding these internals, the data and the code are protected.

Placing Your Order

There are many ways to request data from a data source, such as programmatic extraction, third-party tools, dynamic requests, and external stored procedures. Each method has its place and advocates. After a brief description of the first two, I would like to focus on the last two: dynamic requests via SQL statements and external procedures via SQL stored procedures.

Programmatic Extraction

Programmatic extraction allows each program to use its own methods to extract data. This tends to be used with proprietary data structures. This is usually the most expensive proposition in terms of resources, but in some cases it may be the most efficient because the owner of the data knows the structure of that data.

Third-Party Tools

This is by far the most popular method used today. Rather than have each developer write their own data extraction routines, third-party vendors have written such tools as Open Database Connectivity (ODBC ), DBLib, OLE DB, and ActiveX® Data Objects (ADO).

These types of tools provide the developer with data access techniques that are optimized for the data source or, in the case of ODBC, provide a generic data access strategy. While these tools provide excellent access to data, they still restrict the developer to a certain access paradigm. By wrapping a DAL around these tools, the developer can learn a fixed API set and not worry when the underlying strategy changes or is replaced.

Dynamic Requests

By dynamic requests, I do not mean the type of request that an end user might type in, but rather one that is generated programmatically. The request can be generated as a result of programmatic logic or it can be selected from a list of predetermined requests. Either way, the exact nature of the request is not known beforehand and is generated or created at run time.

One of the best languages for this type of request is Structured Query Language (SQL). With its rich language syntax and its near-universal dialect, SQL is a prime candidate for generating dynamic requests in a DAL. Also, most third-party tools and libraries work well with SQL.

However, a dynamic request is not the most efficient or secure method for requesting data. The security problem can be seen when you need to send SQL statements such as:

UPDATE Accounts
SET Balance = Balance + 100.00
WHERE Account = 12345678;

This statement is very easy to understand, but a security breach may occur when you have to supply sensitive information, such as an account number, over a network. Also, the entire statement is more than 60 characters long—any corruption may cause an error or even cause the statement to point to another account.

These types of requests can affect performance, because not only is more information sent over the network wire, but also, in the case of many SQL engines, it must be validated and compiled each time the engine receives the statement. If your application or system is going to be sending massive amounts of these types of queries, the great number of validation/compilation cycles will slow down everyone who needs to access the data.

Stored Procedures

A better method than using dynamic SQL statements is to predetermine what SQL statements need to be executed and store them in the SQL engine. This allows the SQL engine to prevalidate and precompile the statements. All you need to do to execute them is pass in the appropriate arguments. In this case, the code for updating an account balance can be as follows:

EXECUTE spAddBalance 12345678, 100.00

Although the stored procedure spAddBalnce does give away what the procedure does, it could have been just as easily written like:

EXECUTE spB123 12345678, 100.00

The name of the stored procedure doesn't matter if the developer knows what is what. With stored procedures fewer parameters have to be passed in, the SQL statements are prevalidated and compiled, and the amount of data sent over a network is smaller.

Afterthought

The issues raised by the above discussions of dynamic and stored procedures would be minimized if the DAL were running on the same machine or side that the SQL engine is running on, and all communication to the DAL were through some other component. This would allow only the DAL to create or generate the dynamic requests or use stored procedures and, thus, alleviate the security and transmission issues.

Duwamish Books

Because Duwamish is as much a migration scenario as a sample, we used the dynamic SQL method in the early phases and migrated to using stored procedures in the latter phases. The transition from using dynamic SQL to stored procedure coincided with the shift from Microsoft Access to Microsoft SQL Server™. As such, the SQL statements used for the latter phases of Duwamish are optimized for SQL Server. There are two major reasons why Duwamish made the shift from Access to SQL Server: first, Access does not have the ability to use stored procedures and, second, Access is not designed to handled thousands of concurrent users.

Picking up Your Order

There are multiple different ways to retrieve requested data, such as files, arrays, collections, and ADO Recordset objects.

Files

Using a file to retrieve information from somewhere else may sound redundant, because the original data is probably already in a file. However, this mechanism is useful if you only need a part of the file, or need the retrieved data to persist even after the computer is shut down.

Most of the time, however, you'll need the data to be more readily available with a much quicker access time. This is where the following mechanisms come in because they all use memory to hold the data.

Arrays

Using an array to hold and transfer data is probably one of the most efficient resources when it comes to sheer memory footprint. The data is bundled up in one tightly packed entity and can be transmitted as one big chunk of memory, or binary large object (BLOB). Although this mechanism is memory-efficient, it is the most difficult to use effectively because you must write all of the handling code. An array is simply a chunk of memory that you can index into in order to manipulate an entry. The array has no intrinsic means of searching, deleting, extending, or adding to itself. In order to perform these tasks you must code the routines to do this. To add or update an element in an array, you simply modify the array contents at a certain index into the array. There are implementations of smart arrays that have these abilities, but then they are no longer simply arrays but more like collections or objects.

Collections

A COM collection is a very effective way to send data around because it has the ease of use of an array without the limitations. A collection is easy to add, update, and delete from. Although not as simplistic as an array for iterating through, a COM collection is designed for iterative looping through. One of the most highly used features of a collection is the ability to search for a specific element in the collection by an associated key.

ADO Recordsets

An ADO Recordset object can be thought of as a very smart array with the abilities of a collection (and much more). Not only can a recordset be used to hold data, but it also can be searched through, added to, deleted from, updated, and so on. If a recordset is connected to a data source, any changes made to the recordset can be sent back to the data source with a simple command. If the recordset is not connected to any data source (this is called a disconnected recordset), the recordset can be transmitted over the network and manipulated in a location other than its originating code.

Duwamish Books

Because Duwamish uses ADO for its underlying native data access tool and ADO has an excellent object (Recordset) for passing data, we decided to use the ADO Recordset as our data-passing mechanism. Not only does this provide us with a fairly stable data object (ADO should be around for awhile), but we did not have to code all of the handlers for the various methods.

Staying in Touch

There are many advantages to disconnected recordsets, but there are also times where a connected recordset is the best choice. What's the difference between the two?

Duwamish Books

The Duwamish Books sample uses both connected and disconnected recordsets. When a group of records is required for display purposes only, the data is retrieved in a disconnected recordset. When the data is required to be connected, or a record's update is required before any other code can continue, connected recordsets are used. For example, when a sale order is being added to the database, a sales header is created in a connected manner, the fields updated, and the record written back to the database. This allows the next piece of code to immediately get the sales header record's primary key (PKId) field to use with the associated sales detail records. Once the sales detail records are added to the database, the connection is finally broken. Without the sales header connected, the PKId field would not be filled in and therefore would not be available for use with the sales detail records. Both of these operations are surrounded by a transaction, so if either fails, the operations can be rolled back.

Connected

A connected recordset is a recordset object that maintains a connection to the data source. This connection can be maintained during the life of the recordset, or it can be disconnected at any time by changing the active connection. If the connection is broken, any updates to the recordset will not be transmitted to the data source until a connection is reestablished.

The advantage of a connected recordset is that the recordset has instant (what a relative term!) access to the data. How "instant" is dependent on how the recordset is connected to the data source. If the recordset is created to maintain a dynamic set of data, changes to the original data are reflected in the recordset. If the recordset was created to be a static snapshot of the data, changes to the original data are not reflected in the recordset and are only picked up when the recordset attempts an update.

The disadvantage of a connected recordset is that the connection to the data source is carried with the recordset, no matter how far down a network it travels. Doing so loads the data server with the overhead of maintaining the connection and increases the resource usage by the Recordset object.

Disconnected

A disconnected recordset is simply a recordset that has been populated with data from the data source and then disconnected from that data source. It doesn't have the overhead of the connection to the data source. However, any updates to the data are not reflected in the original data source until the recordset is reconnected to the data source and an update method invoked.

An All-or-Nothing Proposition

Often a change to data may take more than one alteration to the data. Transacted recordsets ensure that all changes take place or, if necessary, that are all reversed.

Nontransacted

A nontransacted recordset is one in which the changes to the data are made permanent without any chance of programmatically saying, "Oops, that didn't work, so undo my last change!" Essentially, there is no "undo" command.

Transacted

On the other hand, a transacted recordset allows you to control whether your changes are made permanent. You can begin a transaction, make some changes, and then commit all those changes, or "undo" them. Once a decision is made, the code can close off the transaction.

Duwamish Books

Duwamish Books uses both transacted and nontransacted recordsets. In the case of such atomic changes as adding a record to the inventory, it uses nontransacted recordsets. When recordsets are dependent on another recordset, it uses transacted recordsets.

The addition of sales details after the addition of a sales header is a classic case of a transacted recordset. The sales header must be added to the database first to obtain a PKId, and then the details are added. If, by chance, the details violate database referential integrity rules, the sales header addition can be rolled back without affecting the rest of the database.

Frequent Flyer Points

Maintaining a connection to a data source can be expensive in terms of resources unless you need to make frequent calls to the data source, in which case it might be better to maintain a connection to the data source until all of the calls are finished.

Duwamish Books

Several parts of Duwamish Books maintain a cached connection while they do multiple round trips to the database to gather multiple recordsets. Once these multiple recordsets are gathered, the cached connection is dropped.

Cached Connections

Using a cached connection is very useful when you need to make frequent trips to the data source, because opening and closing a connection every time you want to affect the data can be costly in terms of time, memory, or transmissions of data. For example, cached connections are useful when using transacted recordsets, because the connection will be needed later to either commit or roll back the change.

Noncached Connections

What is meant by noncached? It means simply that a connection to the data source is not maintained over the life of a recordset. A noncached environment is much easier to work with and is faster overall.

Conclusion

The design issues described in this article are the more salient points that we ran across when writing the Duwamish Books data access layer—we struggled with them and wish to pass what we learned on to you.