OLAP Services: Managing Slowly Changing Dimensions

Amir Netz
Microsoft Corporation

April 1, 1999

Summary: This article outlines several main types of slowly changing dimensions. After showing how an online analytical processing (OLAP) server operates when reading dimension members, the article then offers ways to handle these and other real-world scenarios using Microsoft® SQL Server™ OLAP Services. (17 printed pages)

Contents

Introduction
The Types of Slowly Changing Dimensions
How the OLAP Server Handles Dimension Changes
Managing Type One Slowly Changing Dimensions
Managing Type Two Slowly Changing Dimensions
Changes in the Hierarchy for High-Level Members
Conclusion
Finding More Information

Introduction

Although gradual changes to database information are a sign of a healthy data warehouse, organizations face a challenge when maintaining a cube, or multidimensional representation of detail and summary data, over time. For example, a sales representative might transfer to a new branch office. If his old sales figures move to the new branch with him, a report from his original office will suddenly show historically poor performance—to the dismay of the branch manager. Other common changes, called slowly changing dimensions, may include the addition of a new product, a department reorganization, or changing characteristics of a property (changing the salt percentage in a soup listing, for example).

One of the common problems in maintaining cubes is the fact that many dimensions change over time. Mostly, these changes are slow. Situations where a dimension is completely changed overnight are rare. Drastic changes are usually caused by design changes or, less often, by a complete reorganization of the relationships among members in the dimension. However, occasional small changes are normal in business. Examples of these changes include:

In a data warehouse, dimensions that reflect these types of incremental changes are known as slowly changing dimensions. Slowly changing dimensions are a natural phenomenon that occurs in almost every healthy data warehouse. However, this phenomenon poses significant challenges to cube maintenance.

To better explain some of the problems, here are some examples. A sales rep, Bob, is transferred from one branch office to another. What should happen to all of his existing sales transactions? Should these transactions move with him to the new branch? If the transactions are moved, a retroactive change of history will occur because all of the historical sales will be associated with the new branch. It is quite certain that the manager of the original branch will not take it lightly that the established performance of his branch is suddenly degraded.

Suppose you decide to leave the transactions in the original branch. Who should own them? The sales rep does not exist in that branch anymore. What about the transferred sales rep? Will all of his old sales history be lost?

What if you want to rewrite history? For example, assume that a product category is reclassified. If the old transactions stay in the original category, after a while it would look as if the demand for the original category had dropped and the demand for the newly assigned category had grown. If demand patterns have not really changed, the data would then be misleading. In this case, all the existing transactions should move from the original category to the new one. But how should this be accomplished? Rewriting old transactions is problematic because the aggregations must be updated as well.

What about the case where a dimension member is deleted? For example, a business unit was disbanded during a reorganization and no longer exists. What should happen to all of the transactions of the former business unit?

These are all valid questions and real-world common problems. This article provides solutions and ways to handle these scenarios using Microsoft SQL Server OLAP Services. It assumes that the reader has a basic understanding of the concepts of OLAP in general, OLAP Services in particular, and multidimensional expressions (MDX).

The Types of Slowly Changing Dimensions

In The Data Warehouse Toolkit, Ralph Kimball identifies three types of slowly changing dimensions. When Kimball categorizes his types, he uses the maintenance technique as the type identifier. However, because techniques are often subjected to changes, improvements, and variations, it is better to classify the types by their behavior instead. This section presents alternative definitions of Kimball's dimension types based on their most important and distinguishing characteristic: reflection of historical events.

Type One—Rewriting History

Type One dimensions are those in which you want to rewrite history. A property of a member has changed and the change should be retroactively reflected in history as if the new property value were the original value. For example, take the case of the classification of a product that was incorrectly entered as Class A into the system and transactions were already filed on the product. When the classification is corrected to Class B, all of the historic transactions from Class A should reflect the new classification. In a simple normalized system, all you have to do is overwrite the values in the row of the product in the products dimension table. This is the technique Kimball suggests in his book.

This technique is far from sufficient in a modern data warehouse; managing this situation requires complex management of the transactions. The main problem with Type One dimensions lies in the maintenance of the aggregations. If the product moves from Class A to Class B, the appropriate adjustments need to be reflected in all of the classes' aggregations. This poses a formidable challenge in the data warehouse environment.

Type Two—Keeping Track of History

Type Two dimensions are those in which you want to keep a record of the old dimension data. For example, when an employee is moved from branch to branch, all of the old transactions should remain in the old branch, and only the new transactions should relate to the new branch. Kimball suggests creating a new record for the employee in the employee dimension table in this scenario. This suggestion will become the basis for the maintenance techniques suggested in this article. However, as with Type One dimensions, performing this first step is just the beginning of the story.

In using this technique, the main problem is the fragmentation of the data. The employee now has two members associated with the transaction history in the dimension table, thus causing fragmentation of the data. When browsing the transaction history, the employee will find transactions associated with his or her track record in two places in the hierarchy. This is a tough problem and this article will suggest some techniques to deal with it.

Type Three—Keeping Track of Versions in Parallel

Kimball strongly recommends not using this type of dimension. Microsoft agrees with him; therefore this article does not provide redefinition of or solutions for Type Three dimensions.

How the OLAP Server Handles Dimension Changes

Before delving into the solutions for these problems, it is important to understand how dimension members are maintained in the OLAP server. When a dimension is processed, all of the members are read from the dimension tables. When the members are read, two major operations happen:

This code is the basic building block of the entire OLAP system. From this point, all references to the dimension members and all storage and other representation of OLAP data are performed or acted upon using this code. The code is not arbitrary. It has a special internal structure from which all of the relationships between the member and its ancestor members can be derived. To some degree, the code is constructed like the path of a file in the operating system. Because of this similarity, the code is called the path of the member. For example, the path of Seattle in the geographic dimension might be "Whole World\USA\Washington\Seattle." In multidimensional expressions (MDX), the path is expressed as:

[Whole World].[USA].[Washington].[Seattle] 

Although this form of the path is readable, it is not how the path is actually stored. In the OLAP server, the path is encoded and stored in a compact set of bits that are not readable. However, one essential principle stays the same: the path code not only represents the member but also contains all of the necessary information about its ancestors (in this example, state, country, and world). This principle is the core of the basic architecture of the OLAP server and allows for the efficient data storage and fast query processing of the OLAP server.

Why is the path system so important when dealing with changing dimensions? The answer is simple: if the path contains the information about all of the hierarchical relationships of the member to its ancestor, there is no way to move a member in the hierarchy. If a member changes position in the hierarchy, then by the definition of the path identifier, it must be assigned a new path. If the member has a different path, it is no longer the same member.

Do not be too alarmed; the most popular way to update dimension members (also known in Kimball's book as Type Two Changing Dimensions) conforms well to this principle.

Incremental Updates

The way the OLAP server maintains dimension members affects incremental updates. Dimension members are always read with all of their ancestors. The OLAP server needs the ancestors in order to encode the path. If the path is already present in the dimension tree, this means that the dimension member is a known one, and the member is skipped. If the path does not exist in the tree, a new member is created and the path is properly encoded. What is being searched for is the path, not the member itself. For example, if Seattle were to move to Oregon (due to a big landslide), Seattle, Oregon would be considered a new member unrelated to Seattle, Washington. This principle allows two cities with the same name located in different states to coexist. A real-world example is Portland, a city in both Oregon and Maine. Because the new path is considered a new member, nothing happens to the old instance of Seattle in Washington. Thus, it can be deduced that:

Incremental updates of the dimensions allow the data in the cube to remain intact. The other form of dimension update, known as processing, reassigns new codes to all of the dimension members, thus causing all of the cube's data to become obsolete. After a dimension has been reprocessed, all cubes that use the updated dimension lose the cube data and must be reprocessed.

Managing Type One Slowly Changing Dimensions

There are some cases where a change in position of a member in the dimension hierarchy requires adjustment of all of its related transactions to the member in the new location. A typical example can be seen in a product hierarchy: a product may have been originally classified in Category A, but after further review, the product is moved to Category B. As explained in the previous section, the only way to move a member is to create a new member in the new position in the hierarchy. When this is done, nothing changes in the underlying fact table. All existing transactions related to the product remain under Category A and only new transactions roll up to Category B. Over time, the stream of transactions that roll up to Category A decrease and the number of transactions that roll up to Category B increase. Such an approach might cause a misconception that the demand for Category A is falling and the demand for Category B is on the rise. However, purchase patterns of the customers may not have changed at all. The customer may still be buying the exact same quantities of each product.

Because the objective in Type One dimensions is to rewrite history, you need to readjust all of the historic transactions to portray a product that has consistently resided in a single category, Category B, from the beginning. In a simple RDBMS environment this would have been easy; change the dimension table entry for the product and then reassign it to a different category. From that point on, all of the product's transactions are associated with the new category. Ironically, in a sophisticated and a well-designed data warehousing environment, the situation is far more complex. A modern data warehouse contains aggregations. These aggregations are maintained in either a cube (multidimensional OLAP, or MOLAP) or relational tables (relational OLAP, or ROLAP).

In this example, some of the aggregations may already maintain the summary values for Category A. Changing the dimension table will not affect this summary unless the aggregations are recalculated. However, recalculating all of the transactions would require reaggregation of all the transactions in the fact table.

There are several ways to maintain the aggregates:

The Simple Solution: Processing

A simple, yet effective, solution is to process the dimensions and cubes. As discussed before, processing rebuilds the whole dimension structure, which voids all of the old cube data. So, in addition to reprocessing the dimensions, a process of all of the cubes must follow. Applying this approach is easy. With a single mouse-click, you can Process the Database in the OLAP Manager. This causes all dimensions and cubes to be processed.

This approach eliminates the need to deal with changes in the hierarchies, rewriting history, deleting members, and so on. There is no history to deal with. No old members need to be deleted, and no hierarchies are changed. Each time a database is processed, it is as if the dimensions and cubes were processed for the first time.

When the database is processed, all changes to the dimension are written to the dimension table, replacing the existing values in the dimension record. For example, if a sales rep transfers from Branch A to Branch B, you only need to update the Branch field in the sales rep record.

This approach has two important drawbacks:

This approach is often applicable to small businesses where the volume of data is limited and can be read easily overnight (usually fewer than 10 million transactions accumulated in the last several years). It also may be used in businesses where it is not important to keep track of historical changes.

The main advantage of this approach lies in its simplicity. No database administration expertise is needed and no complicated logic or complex update procedures are required.

The Slow Solution: Virtual Dimensions

Virtual dimensions are pure Type One dimensions. Virtual dimensions are based on member properties. The values of the member properties can be modified by the incremental processing of the dimensions without ever causing a change in the member path. Because a member property always holds the last value written to it, and because virtual dimensions do not include aggregations, virtual dimensions comply fully with the spirit of the Type One dimensions Kimball describes.

However, virtual dimensions have several important drawbacks:

If the cardinality of the source and virtual dimensions is not a problem, using a virtual dimension is good solution for a Type One dimension. However, virtual dimensions will always be slower than regular dimensions.

The Sophisticated Solution: Correction Transactions

In many cases, especially in an enterprise data warehouse, the first two solutions are unacceptable. Reprocessing tens or hundreds of millions of rows to accommodate a single change in a dimension hierarchy is extremely expensive. With large dimensions, the performance of virtual dimensions may become too slow. Another solution must be adopted.

The proposed solution is the accounting system approach. In accounting systems, history is never rewritten. It does not mean that accounting systems do not contain incorrect transactions. Actually, there are many occasions where the wrong account is debited or credited. When this happens, the transaction is never modified or deleted. Instead, correction transactions are issued.

If a credit is wrongly issued to Account A when it should have been issued to Account B, two transactions will be formed. A voiding transaction debits Account A for the sum and a reassignment transaction credits Account B. The end result shows a rewritten history, but none of the historic transactions are actually modified. The only changes are the addition of more transactions.

The same approach can be used in the data warehouse. OLAP Services provides excellent tools for incremental updates of the data warehouse tables and cubes. However, these tools work on the premise that historical transactions are not modified and all increments are in the form of new transactions. This is exactly what the accounting system approach gives us.

When a product is moved from Category A to Category B, the following actions occur:

  1. As in the Type Two approach (described later in this article), a new member will be created in the dimension table. This member (under Category B) will reside alongside the old instance of the member (under Category A). An incremental update of the dimension should then take place.

  2. All of the transactions of the product should be collected. You can use one of following approaches:
  3. After the transactions are collected, they should be stored in a temporary table in the relational database. For each of those transactions, two new transactions must be generated:
  4. After all of the correction transactions have been generated, an incremental update should occur. Usually an incremental update consists of two parts:

The correction transaction approach has two known limitations:

One way to ease the filtering of the deleted members through the use of client tools would be to create a generic calculated member that will return NULL for the cell value if a member is deleted. Then, a regular NON EMPTY clause in the MDX query on the client side will omit all of the deleted members from the axes. A technique that exploits this behavior is described later in this article.

Managing Type Two Slowly Changing Dimensions

Type Two slowly changing dimensions keep track of history. Whenever changes occur in the dimension, either a change in the position in the hierarchy or a change in a property value, the existing transactions should continue to be associated with the old values (that is, those that existed before the change). Only new transactions will be associated with the new form of the member.

With Type Two slowly changing dimensions, historic information about the member must be maintained. When a member changes, the record in the dimension table is not updated. Instead, a new record is created for the dimension member. Of course, a new key must be assigned to the new member in the dimension table. From that point on, all transactions related to that member use the new member key.

This approach appears to be the perfect solution to the transferred sales rep scenario described earlier in this article. When the sales rep moves from one branch to another, a new record is created for the sales rep in the dimension table with a new key. From that point on, the sales rep in the new branch is treated as a sales rep different from the sales rep in the old branch. All of the existing transactions are still related to the original branch. Maintaining the dimension on the OLAP server is simple in this case. All that is necessary is to perform an incremental update on the dimension (which now includes a new record for the sales rep).

Something is wrong with this scenario. The two instances of the sales rep are treated as two different people, but this is not the case. Suppose the sales rep, Bob, wants to review personal annual sales. When he drills down under the [new] branch, he sees only the sales made after he reported to the new branch. Sales made during the time he reported to the original branch are not visible. His transaction history is fragmented. He must drill down to the original branch and look for the sales recorded by his original sales rep member. He must then manually consolidate the numbers from both members. While he might be willing to go through this tedious process, it may be still more complicated for his manager. When preparing the sales rep's annual review, the manager might not know that he reported to other branches during the year and might have the impression that the rep's track record is poor because only one fragment of his sales history is visible. Is there a way to keep track of history and still correctly consolidate numbers for a changing member?

A completely different scenario might occur when a property of the member changes. Suppose the sales rep marries and you want to track the performance of married sales reps versus single sales reps using the Marital Status dimension. All of the sales prior to marriage should be counted as sales by a single sales rep and the sales after the marriage should be counted as sales by a married rep. Updating the marital status of the sales rep in the dimension table is not the answer. This will cause a loss of the historic sales that were made while the sales rep was single. Creating a new member is impossible because the sales rep did not change position in the hierarchy.

What can be done? Here are two solutions to these problems:

Consolidation Using Calculated Members

This approach addresses the case where the dimension hierarchy has changed and you want to keep track of history. As with the correction transactions of Type One dimensions, the first stage is to add a member to the dimension in the new location. As discussed earlier in this article, this will immediately solve the problem of not losing track of the historical associations. However, there is still the problem of data fragmentation. There are many techniques to solve this problem. All of these solutions use a calculated member to derive the consolidated numbers for the member from the fragmented history.

Consolidation for an Individual Member

The most basic way to deal with the fragmentation problem is to consolidate by creating calculated members that aggregate the fragments. Such a calculated member may look like the following example:

[Consolidated Jane] = Aggregate({[old branch].[Jane], [new branch].[Jane]})

For each member that moves, create a dedicated calculated member to represent the consolidated fragments.

This approach may be suitable for dimensions where changes are rare. Creating the consolidation members does not require any automated infrastructure and the management can be done manually.

For most real-world cases, this technique has several major drawbacks:

In most cases, a more global method for managing the fragments will be required. The methods described in the next section suggest such techniques.

Consolidation for Any Member

Suppose that the cube has only one measure, [Sales]. Simply performing the incremental updates on the dimensions and cubes will cause [Sales] to contain the fragmented measures. In order to consolidate all of the fragments, use a calculated member [Consolidated Sales] defined in this pseudo-MDX:

Aggregate( {all instances of the current sales rep} , [Sales] )

This means that if this sales rep, Jane, moves through three branches during her career, the set will include the three members that represent her (each reporting to a different branch). The main problem is how to find out what this set contains. One simple way would be to filter the level and find all of the sets of members that have the same property, for example:

Aggregate( Generate ( {[Organization].CurrentMember} as S1, 
Filter([Organization].Members, 
S1.Current.Members(1).Name = [Organization].CurrentMember.Name )), [Sales])

This expression aggregates all members in the [Organization] dimension that have the same name as the current member of that dimension. Because this technique is similar to a self join in relational databases, it is necessary to use the alias for the set.

Using the name of the member as an identifier for a property is usually not a good idea. Names are usually not unique enough. Two different employees with the same name may work in the same company. Consider using a member property that uniquely identifies that employee. The following MDX expression uses "ID" as the unique identifier example:

Aggregate( Generate ( {[Organization].CurrentMember} as S1, 
Filter([Organization].Members, 
S1.Current.Members(1).Properties("ID") =
 [Organization].CurrentMember.Properties("ID") )), [Sales])

This expression will work, and work correctly. However, the computations may become very expensive. For each member, the OLAP server will have to scan the whole dimension to look for the matching related members. In cases where a dimension is changed fairly rarely, a member will have no other matching members. There would be no point in filtering the dimension.

One simple improvement is to add another property to the member that will tell you whether the whole dimension scan is necessary. Name this property "Status" and assume that the possible values are the following:

This new property can improve the efficiency of the expression:

iif( [Organization].CurrentMember.Properties("Status") = "Original", [Sales],
iif( [Organization].CurrentMember.Properties("Status") = "Moved", Null(), 
Aggregate( Generate ( {[Organization].CurrentMember} as S1, 
Filter([Organization].Members, 
S1.Current.Members(1).Properties("ID") =
 [Organization].CurrentMember.Properties("ID") )), [Sales])))

With this expression, the expensive calculation will occur only on the members that actually move on the hierarchy.

You can also use a different property convention to create a much more efficient expression. For example, a property called "Related Members" can contain the following values:

Using this property, the expression can look like the following example:

iif( [Organization].CurrentMember.Properties("Related Members") = "$Me", [Sales],
iif( [Organization].CurrentMember.Properties("Related Members") = "$None", Null(), 
Aggregate(StrToSet([Organization].CurrentMember.Properties("Related Members")), [Sales])))

This expression evaluates the set of related members directly from the member property. There is no need to scan the dimension for related members. This is far more efficient than the previous techniques. The cost is in the complexity of the data preparation: You must maintain this property for each dimension member.

Generalizing the Calculations: The Consolidation Dimension

All of the expressions demonstrated earlier in this article aggregate the fragmented data of the [Sales] measure. If you have several measures in the cube, you need to have a dedicated calculated measure for each of those measures. This can be cumbersome to create and maintain.

An alternate technique is to create a new dimension for the purpose of generalizing the problem. The dimension will have a single level that contains a single member: [Fragmented]. This [Fragmented] dimension adds no overhead to the storage and complexity of the cube because it contains a single level with a single member. Add a calculated member, [Consolidated Organization], to that dimension:

iif( [Organization].CurrentMember.Properties("Related Members") = "$Me", [Fragmented],
iif( [Organization].CurrentMember.Properties("Related Members") = "$None", Null(), 
Aggregate(StrToSet([Organization].CurrentMember.Properties("Related Members")), [Fragmented])))

This expression is almost identical to the previous expression except that the member being aggregated is now the [Fragmented] member on the new dimension instead of the [Sales] measure. This single expression will now work equally well on all of the measures. All the user has to do is slice by the [Consolidated Organization] member. This technique is also useful in the cases where there are multiple fragmented dimensions that need consolidation. Suppose the customer dimension is of the same type. All you have to do is to add another single calculated member, [Consolidated Customer]:

iif( [Customer].CurrentMember.Properties("Related Members") = "$Me", [Fragmented],
iif( [Customer].CurrentMember.Properties("Related Members") = "$None", Null(), 
Aggregate(StrToSet([Customer].CurrentMember.Properties("Related Members")), [Fragmented])))

Without the Consolidation dimension, you would have to create additional calculated members for every measure.

Maintenance Using Property Changes

It is common to define more than one cube dimension from a single dimension table. For example, from the product dimension, you might have the classic Department-Category-Brand-Product-SKU dimension, as well as other dimensions, such as Size, Color, Weight, and Package. All of these dimensions describe properties of the products. After physical dimensions (as opposed to virtual dimensions, which constitute an easier problem) are defined on these properties, managing the changes to these dimensions poses new challenges.

One way to manage these changes is to create a new row in the dimension table for the changed member. With this approach, it is necessary to create the new member upon a change in the values of any one of its properties. This involves more than the creation of new members when the hierarchical position changes. It also involves the creation of new members if the position of the product in the hierarchy stays the same but the color of the product (a value of one of its properties) changes. Because there is no guarantee that the member will change its position in the hierarchy, the new member record must come up with a new name for the member.

For example, if Product X in Category A is Red and the color changes to Pink, a new member must be created. Because there is already a Product X in Category A, the new member must have a different path: for example, Product X (v2) in Category A. A naming scheme must be devised to handle this change. After the naming problem is resolved, the fragmentation problem still exists, and you can use the previously discussed techniques to overcome it.

A more convenient and intuitive technique, and one that does not cause fragmentation, is the use of minidimensions. Kimball's solution for the minidimensions serves as the basis for a technique to manage Type Two property-based dimensions. This technique, which Kimball describes in The Data Warehouse Toolkit, calls for the creation of a separate dimension table for the properties of the members (called the minidimensions table) and extension of the fact table to maintain a key to the minidimension table directly. When a property value changes, it can be updated directly in the dimension table. However, because the properties that are actually used for queries are derived from the minidimension tables and no retroactive changes apply to existing transactions, it is still possible to relate the old transactions to old property values and the new transactions to the new property values.

A relational query generator must be aware that the property values in the main dimension table must not be used for any GROUP BY queries. Instead, the GROUP BY must apply to the minidimension table. A well-designed OLAP cube will take this into account, and such problems should not occur when using OLAP Services.

Changes in the Hierarchy for High-Level Members

Thus far, this article discussed how to manage changes to leaf-level members (members of the lowest level in the hierarchy). However, changes to the hierarchy do not take place only on the leaf level. Instead, a category may move from one department to another, a branch may move to another district, and so on.

Sometimes you may want to handle these changes as Type One changes. On other occasions you may want to handle these changes as Type Two. In both cases, the same principles apply. When a high-level member moves in the hierarchy, new member instances must be created for the member and all of its descendants through the last level (inclusive), for example:

All
         Department 1
            Category A
               Product X
               Product Y
         Department 2
            Category B
               Product W
               Product Z

If Category A is moved to Department 2, the updated dimension tree should look like the following example:

All
         Department 1
            Category A
               Product X
               Product Y
         Department 2
            Category B
               Product W
               Product Z
            Category A
               Product X
               Product Y

A whole subtree is added under Department 2.

Both the Type Two and Type One techniques can be applied to the newly generated members. The Type Two technique will be applied to all of the newly created members (both the category and the products). The intelligent use of member properties is the key to efficient management of this situation.

For Type One dimensions, the accounting system is naturally and seamlessly expanded to deal with the newly created members. Only leaf-level members will generate correction transactions. As far as the accounting system is concerned, there is no awareness that a high-level member has changed position. What the accounting system sees is that several leaf-level members have changed position. The accounting system will treat them as if they had changed position outside of the context of the high-level member changes.

Conclusion

Although they pose some difficulties for data warehouses, slowly changing dimensions are a fact of life for virtually all businesses. This article outlines some of the scenarios that require you to work with slowly changing dimensions and offers some suggestions on how to handle them smoothly in your data warehouse. OLAP Services provides MDX and calculated members, PivotTable Service, and virtual dimensions to help you plan for and accommodate change in your OLAP data.

Finding More Information

For more information about MDX, see OLAP Services online Help.

For more information about slowly changing dimensions, see Ralph Kimball's book The Data Warehouse Toolkit, ISBN: 0471153370.