Algorithms

Hash It Out

Three VBA Hash Table Implementations

By Rod Stephens

Databases like Access typically use trees to store index information. This lets the database locate items very quickly. For databases stored on a hard disk, or other slow storage device, speed is critical. Reading data from a disk takes a relatively large amount of time, so the database must find the data as quickly as possible. Trees make it easy to find ordered data, but under some circumstances, you can locate items even faster using a hash table.

A hash table is a data structure that allows you to store and retrieve items based on a key. This is similar to the functionality provided by the Collection object in VBA. You can add items to a Collection, optionally specifying a key. Later, you can use the keys to locate particular items.

While creating a hash table yourself is a bit more work, a customized data structure can give you more flexibility and better performance than Collections.

The Basics

Suppose you want to store and quickly locate a few dozen items that have unique integer keys between 0 and 99. You could build an array with index bounds from 0 to 99. Then, you could store an item with key K in position K in the array. To locate an item with key K, you would look in position K. The following code fragment shows this simple scheme:

Dim Values(0 To 99) As Variant

' Insert an item with key 13 in the array.
Values(13) = "This is a really simple hashing strategy"

' See if a value with key 27 is present.
If IsEmpty(Values(27)) Then
  MsgBox "Key 27 is not present"
Else
  MsgBox "Values(27) = " & Values(27)
End If

This method is very simple. It's also extremely fast. With just a single array access, you can tell if the item is present and find its value.

Unfortunately, in the real world, key values don't always map into such a conveniently small range of values. For example, suppose you want to be able to locate employees using their Social Security numbers. There are roughly one billion possible Social Security numbers of the form 123-45-6789. To use the previous hashing strategy, you would need to allocate a one-billion-entry array. If each employee's information occupied 1KB of storage, the array would take up one terabyte (one million megabytes) of memory. Chances are good you don't have that much memory or disk space available on your computer. Even if you had that much storage, more than 99 percent of it would always be unused, unless you had more than 10 million employees.

In cases like this, where the number of possible key values is huge, you need to map the key values into a relatively small hash table. For example, if you have around 80 employees, you might create an array with 100 entries indexed from 0 to 99. Then, you could calculate an employee's Social Security number Mod 100 and map the employee's information to that entry. For example, an employee with Social Security number 1234-56-7890 would map to position 90 in the array. Using this scheme, you would only need 100 employee entries taking up a total of 100KB of memory. If you have 80 employees, only 20 percent of that space is unused, so you aren't wasting a lot of space.

Note that this array contains only 100 entries, but there are one billion possible key values. In that case, 10 million possible key values will map to each array entry. For example, all Social Security numbers that end with 99 will map to position 99 in the array.

That means that sometimes you may try to put a record in the table and find that it is occupied by another record. This is called a collision. To resolve this problem, you need a collision resolution policy. This is a rule that tells where an entry goes when the place it belongs to is already occupied. Usually, the policy is to remap the record to another position in the table. If that position is also in use, the record is remapped again and again until an empty position is found. The sequence of positions that are examined searching for an empty position is called the item's probe sequence.

To summarize, a hashing scheme requires three things:

A data structure called a hash table that holds the entries.

A hashing function that maps keys to entries in the hash table.

A collision resolution policy that tells where to place an item when its natural position in the table is already occupied.

Linear Probing

One of the more popular types of hashing is called open addressing. Here, the value of the key is used to calculate an offset in memory where the data should be placed. The previous examples used open addressing to map a key value to a position in an array.

There are several varieties of open addressing schemes that differ in their collision resolution policies. The simplest is called linear probing. If an item's position is occupied, the program simply looks at the next position. If that position is also occupied, the program looks at the next position. The program continues looking through the array until it finds an empty position, or it has searched the entire array. If the program finds an empty position, it inserts the item there. Otherwise, the hash table is full, and there is no room for another item.

The following code shows a user-defined type and constants used to manage simple hash tables:

' Define a UDT for hashing examples.
Public Type HashType
  Value As Variant
  Key As Long
End Type

' Constants for hash table operations.
Public Const hashInsertOk = 1
Public Const hashInsertTableFull = 2
Public Const hashInsertDuplicateKey = 3

Listing Two, beginning on page X, shows the VBA source code for a hash table class that uses open addressing. After creating a LinearHashTable object, the program should call the SizeTable method to allocate the hash table array. It can then use the AddItem function to insert items into the table. It can use the FindItem function to locate items in the hash table. This function searches the hash table and returns the value of the item if the key is present in the table. If the key isn't present, the function returns the Variant value Empty. (This and the other examples in this article are available for download; see end of article for details.)

The Linear UserForm, shown in FIGURE 1, demonstrates open addressing with linear probing. Enter a value in the # Values text box and click the Create Values button to make the program insert random values in the hash table. To keep the program simple, each item's value and key are the same, and are numbers between 0 and 999.

FIGURE 1: The Linear UserForm demonstrates linear probing.

Enter a value in the Value text box and click the Get button to make the program locate the item in the hash table. In FIGURE 1, the program has just located the item 862.

Enter a value and click the Set button to make the program insert the item in the table.

Quadratic Probing

Linear probing is fast and simple, but it suffers from an annoying primary clustering effect. When you add many items to the hash table, they tend to cluster together. The hash table shown in FIGURE 1 is only half full, but many of the items lie next to others. That makes inserting and finding items slower than it would be if the items were more evenly spaced. For example, if you try to insert the value 109 in this table, its probe sequence will collide with the items 179, 479, 281, 382, and 283 before it finds an empty position.

If the values were spaced exactly evenly, there would be an empty spot next to every used position. Then the program could find an empty position for any new item in, at most, two tries.

The reason clusters form is that there is a slightly higher probability of a new item being inserted next to an existing item than of it landing somewhere else. For example, suppose a hash table with N entries contains one item. When you insert a new item into the table, there is a 1/N chance that it will land in any particular position. However, if the item maps to the same position as the item that is already in the table, it will be inserted next to that item and it will form a small cluster. If the item maps to one of the positions next to the previous item, it will also start a small cluster. There is a 3/N chance that the new item will land next to the previous item. As the table fills, chances are good that large clusters will form.

The reason clusters form is that any item that maps to part of a cluster gets added to the end of the cluster. You can prevent that behavior using quadratic probing. In this method, the indexes in an item's probe sequence are given by the equation (K + P2) Mod N where N is the size of the table, K is the key, and P = 0, 1, 2, ... The program follows this probe sequence until it finds an empty position, or until it has checked N positions. At that point the program gives up and refuses to insert the new item into the table.

FIGURE 2 shows two probe sequences for an item being inserted into the first position in a hash table. The linear probe sequence examines a few adjacent positions, then inserts the new item next to the others, extending the cluster. The quadratic probe sequence, on the other hand, jumps through the array skipping entries, and inserts the new item in a position not adjacent to the cluster.

FIGURE 2: Quadratic probing reduces primary clustering.

The way quadratic probing skips through the hash table makes clusters less likely to form and makes them grow more slowly. That means a program that uses quadratic probing can insert and find items more quickly on the average than one that uses linear probing.

The code for the QuadraticHashTable class (shown in Listing Three beginning on page XX) is very similar to the LinearHashTable code. The only differences are in the AddItem and FindItem routines.

The Quadratic UserForm shown in FIGURE 3 is also similar to the previous example, except it uses quadratic probing instead of linear probing. Although the same values were inserted in FIGURES 1 and 3, the clusters in FIGURE 3 are smaller.

FIGURE 3: The Quadratic UserForm demonstrates quadratic probing.

Pseudo-random Probing

Quadratic probing reduces primary clustering, but it still has some problems. Items that initially map to the same position in the array follow the same probe sequence. For example, if the hash table contains 100 entries, the values 100, 200, 300, and so on, all follow the same probe sequence. If the program inserts many items that map to the same position, they will form a secondary cluster spread throughout the array. The effect isn't as noticeable as that of primary clustering, but still reduces performance.

One way to eliminate secondary clustering is to use a pseudo-random probe sequence. With this technique, the locations in an item's probe sequence are given by (K + Rand(P)) Mod N, where N is the size of the table, K is the key, P = 0, 1, 2,... and Rand(P) is the Pth number in a sequence of random numbers. The sequence of numbers depends on the key's value, so different values will have different probe sequences, even if they initially map to the same position.

In VBA, you can use the random number function Rnd to produce sequences of pseudo-random numbers. To initialize Rnd, first call Rnd with the parameter -1. Then, call Randomize with the value as a parameter. For example, the program uses the following code to initialize the random number generator for the value new_key:

Rnd -1
Randomize new_key

When the program later needs to locate this item in the hash table, it repeats these steps to initialize the random number generator for the same new_key value. Then Rnd will produce the same sequence of numbers it produced when the item was added to the table. This is important: If it produced a new sequence of random numbers, the program wouldn't be able to follow the same probe sequence it used to insert the item, so it would be unable to find the item.

The RandomHashTable class is similar to the previous hash table classes (see Listing Four, beginning on page XXX). Again, the only differences are in the AddItem and FindItem routines.

Conclusion

While quadratic and pseudo-random probing often give better performance than linear probing, they also have some drawbacks. Neither of them is guaranteed to visit every item in the hash table. For example, suppose you have a hash table with only eight entries. To insert item 4 into the table using quadratic probing, the program would follow this probe sequence:

4 + 02 = 4

4 + 12 = 5

4 + 22 = 8 = 0 mod 8

4 + 32 = 13 = 5 mod 8

4 + 42 = 20 = 4 mod 8

4 + 52 = 29 = 5 mod 8

4 + 62 = 40 = 0 mod 8

etc.

After eight probes, the sequence has only visited positions 0, 4, and 5. If all the table's entries are full except for entries 6 and 7, the program cannot insert the item in the table even though there is space available. Similarly, there is no way to know if or when a pseudo-random probe sequence will visit all the entries in the table.

Even so, quadratic and pseudo-random probing usually provide good performance when the table isn't too full. Of course, when the table holds a lot of empty entries, linear probing does quite well, too.

Using these hash table classes, you can store items and search for keys extremely quickly. If you size your hash table properly, you can get much better performance than you can out of VBA's collections, or even indexed searches in a database.

The files referenced in this article are available for download here.

Rod Stephens has written several books published by John Wiley & Sons, including Custom Controls Library [1998] and Visual Basic Graphics Programming [1997]. His book Ready-to-Run Visual Basic Algorithms, Second Edition [1998] has more to say about hash tables and other algorithms. He also writes algorithm columns in Visual Basic Developer and Delphi Informant. Visit his Web site at http://www.vb-helper.com or contact him at RodStephens@vb-helper.com.