Data Compression and Visual Basic

Wally Foulkes

The genesis of this article should be familiar: Wally read somewhere that it just wasn't feasible to do Huffman compression in VB. Well, he rose to the challenge -- as the accompanying Download file proves.

There are certain things in the life of a programmer that are as sure as death and taxes -- your hard disk will crash and months of work will vanish, the piece of code you spent the most time on will either turn out to be the reason your program won't work or else cut from the spec at the last minute, and the data stored on your hard disk will rise to meet the disk's capacity. I've experienced all of the above. I also think the problem of diminishing disk space is the easiest to solve thanks to data compression.

The first attempts at data compression were done with hardware. I remember when Run Length Encoding (RLE) was all the rage. RLE simply looked at strings of the same character and saved them as the character followed by a number indicating how long the string was. For example, a string of 20 e's would be encoded e20 -- the original 20 bytes being reduced to two.

Modern compression algorithms are a little more complicated than this, but they still take advantage of duplication in the data. Letters like e, t, r, and s occur more frequently than x or z in most text. Database fields are padded out to fill the declared length. As computers got faster and more powerful, it became practical to write programs to compress and expand data.

Data compression is divided into two distinct worlds: lossy and lossless compression. Lossless compression consists of those techniques guaranteed to produce an exact duplicate of the input data stream after a compress/expand cycle. Lossy compression allows some of the original data to be lost during compression and decompression. Speech, music, and images can lose a lot and still be understood or recognized. On the other hand, databases, spreadsheets, and documents would be corrupted by the loss of a single bit of information. [Our brains are far more tolerant of "fuzzy matches" than the accountants are. -- Ed.]

The more we know about the data content of the files and the environment in which the compression will be executed, the better. After all, a compression routine can be written to maximize compression or throughput. More advanced compression algorithms lean heavily on the field of information theory and the notion of entropy: The greater the content, the higher the entropy. In the context of a message, each symbol's entropy is determined by the probability of its occurrence, expressed in bits.

 Number of bits = - Log base 2 (probability) 


If the symbol "e" occurs 20 times in a message and its entropy is 1.26, then its arithmetic coding would be 25.2 bits of encoded data in the complete message. To get the entropy of the message as a whole, you add entropies of all the symbols in the message. In his 1952 seminal paper, "A method for the construction of minimum-redundancy codes," D.A. Huffman introduced a new method for data compression that was state-of-the-art until about 1980. Huffman's method can still be found today in commercial products and is frequently used in two-stage compression with Ziv Lempel sliding window compression (LZH). This article explains lossless compression using the statistical model and the Huffman algorithm.

Huffman encoding
First, the program reads the file and counts the occurrences of input stream symbols. In this step, the program gathers the statistical information needed to build the compression model. Notice that I use the term symbols, not letters -- the algorithm doesn't care whether the input stream is ASCII, EBCDIC, or binary. It just looks at the bits in the bytes so it can encode any type of file.

Next the program constructs a binary tree from the bottom up based on the symbol counts (probabilities). Trees are described by a number of terms borrowed from other disciplines. The root node is the top of the tree. The nodes might or might not have children. A node without children is called a leaf. A node that has a child or children is a parent node. Branches connect nodes. Nodes have weight. For this algorithm, the weight of a parent node is the sum of the weights of its children. To build the tree, the program finds two nodes without parents that have the lowest weights. It creates a parent node for the two nodes and assigns it a weight equal to the sum of the two nodes. The parent node is added to the list of nodes without parents, and the two child nodes are removed from the list. One child node is designated as the path from the parent node when decoding a 0, the other child node the path when decoding 1.

The program repeats this process of finding minimum weight pairs, creating new parents for them and updating the node list until there's only one node left without a parent. This last node is designated as the root of the tree. The tree becomes the model -- or engine -- of the compressor. To illustrate this, assume you have the following list of symbols and weights:

 List = {A = 20;  B = 7;  C = 6;  D = 5;  E = 3} 


The two symbols with the smallest weights are E and D. They become the children of the newly created parent node I'll call ED. This parent node's weight is the sum of its children, which is 8. The list update rule says to remove E and D from the list and add the parent ED. The list now looks like this (also see Figure 1):

 List = {A = 20, B = 7, C = 6, ED = 8} 


Now the nodes with the smallest weights are C and B, and they become the children of the newly created parent node we'll call CB. This parent node's weight is the sum of its children, which is 13. C and B are dropped from the list, and the parent CB added. The list now looks like this (also see Figure 2):

 List = {A = 20, ED = 8, CB = 13} 


The nodes with the smallest weights are now ED and CB, which become the children of the new parent node I'll call EDCB. This parent node's weight is the sum of its children, 21. ED and CB are dropped from the list, and the parent EDCB is added. The list now looks like this (also see Figure 3):

 List = {A = 20, EDCB = 21} 


Finally, the only remaining nodes in the list are A and EDCB. They become the children of the newly created parent node we'll call AEDCB. This parent node's weight is the sum of its children, 41. A and EDCB are dropped from the list and the parent AEDCB is added. The list now looks like this (also see Figure 4):

 List = {AEDCB = 41} 


When there's only one node left in the list, that node is the root node. In the code, I defined the following structure:

 Type Node 
    Weight As Integer      'Node weight 
    Child1 As Integer      'Branch 1 path 
    Child0 As Integer      'Branch 0 path 
 End Type 


I assigned the smaller weight of the pair to Child0 and the other to Child1. I could have swapped the assignments throughout the tree (see Figure 5).

The symbol codes would change, but the number of bits in the codes would be the same. Notice that leaf nodes have different numbers of code bits. The symbol A has a 1-bit code, while the other symbols have 3-bit codes. The entropy of the symbol A is the highest, so it occurs most frequently in the data. The tree gives it the fewest bits, so it will take up the least space in the compressed data. How do we handle two or more symbols with the same weights? With rules.

After building the Huffman weight tree, the program writes the tree counts to the output file. This allows the decoder to rebuild the tree when it decompresses the data. The tree data is the overhead in an encoded file. Counts are saved in front of the encoded data.

Next, the program reads the file again and replaces the symbols it reads with the Huffman codes. Note that the encoder outputs a bit stream, not a series of bytes. Because the symbols that occur most frequently are represented by the shortest strings of bits, the resulting output will be shorter than the input string (see Figure 6).

Huffman decoding
To decode the compressed data, the program first reads the tree counts from the encoded file. It then recreates the binary tree used to compress the file from the bottom up based on the counts. This process is similar to the one used by the encoder to build the tree. Next, the program reads the Huffman codes from the compressed data and converts them into the corresponding symbols. This process involves reading the data bit by bit and using the bits to walk through the tree to a leaf. When the program reaches a leaf, it outputs the symbol represented by that leaf (see Figure 7).

These Huffman algorithms for encoding and decoding are based on what's called an order 0 statistical model. The order indicates whether or not we look at previous symbols. Order 0 doesn't look at previous symbols, while order 1 looks at one previous symbol, order 2 looks at two previous symbols, and so on.

An order 0 model requires a relatively small overhead of around 256 bytes minimum to save the symbol counts. A higher-order model would give better compression, but the cost in overhead makes it prohibitive. An order 1 model would require the program to save the counts for 256 different trees, or 65,536 bytes minimum.

The VB encoder
Compress.exe, available in the accompanying Download file, is easy to use. You can type in the path directly or use the file menu, and the output path text box will have a default option based on the path and filename of the input path. To accept this option, press Compress. The file is opened for binary access, read, and the encoded output file is opened for binary access write. The LOF function is used to get the size of the input file in bytes. Counting characters and outputting Huffman codes can take an appreciable amount of time for big files. A percentage complete progress bar is set up by dividing the LOF by 100 and adding 1 to get a ProgressStep variable initialized.

The CharacterCounts subroutine is called to load the Counters array of 256 longs and calls ProgressDisplay at appropriate times to update percent complete. The lstCounts list box is loaded with the raw counts. There's a print routine here if the user decides to print the raw counts. I close the input file at this point, but I'll reopen it later.

ScaleCounts is the next subroutine called. To keep overhead low, I only want to send integer counts to the output file. Since these tree weights are additive, they can become quite large. The program scales them so the weights can fit within integers. There's room for some experimentation here. I really hate magic numbers, but this routine has one. I can't explain dividing the MaxCount by 255, other than to say that it works. I experimented here some, but in the end I left it alone.

This routine also loads the lstWeights list box with scaled weights (counts), making it easy to compare raw and scaled counts. Selecting a count in one list box selects the corresponding count in the other list box. The exception to this is the weight for node 256. Node 256 is the special EndOfStream (EOS) symbol that the decoder uses to find out where to stop decoding Huffman codes. EOS is assigned a weight of 1.

The BuildTree function creates the binary tree and returns the root node in this user-defined array structure:

 Private Type TreeNode 
     Weight As Integer 
     SavedWeight As Integer 
     Child1 As Integer 
     Child0 As Integer 
 End Type 


The Weight parameter holds the scaled weight before the tree is built; it's 0 after the tree is built. The SavedWeight parameter saves the scaled weight for OutputCounts routine. The Child1 parameter contains the 1 branch node if it exists. The Child0 parameter holds the 0 branch node if it exists.

   Dim Nodes(514) As TreeNode 


There are 257 (0-256) symbols, a possible 256 (257-512) nodes, one dummy (513) count, and one (514) unused.

We know that leaf nodes occupy some or all of the first 256 nodes; therefore, the first free node that can be created is node 257. The variable NextFree is initialized to this value. Dummy count node 513 is initialized to &H7FFF to guarantee that no weight will exceed it as the tree is built.

The variables Min1 and Min2 are initialized to the node 513. Following the rules of the algorithm, Min1 will contain the node of the first instance of a non-zero minimum weight. Min2 will contain the next instance of a minimum non-zero weight.

When creating a parent node, the weights of the nodes in Min1 and Min2 will be added together and placed in the Weight variable of the next free node. Min1's node and Min2's node will have their Weight variables saved in their SavedWeight variables, and their Weight variables will be reset to 0. They're then eliminated from the list in this way.

The parent node is added to the list because it now has a non-zero Weight. Child 0 of the parent node is the node in Min1, and the Child 1 of the parent node is the node in Min2.

This process continues until Min1 contains the last free node and Min2 contains the node 513. The For loop exits, NextFree is decremented by one, and NextFree is returned to the caller as the RootNode.

The For loop checks nodes in ascending order. Several nodes may have the same weight, but Min1 will contain the first one encountered and Min2 the next encountered.

As part of the development process, I drew the tree produced from my test file on paper to verify how the tree was constructed, and I used it later to verify the Huffman codes produced in the next routine. If the file to be encoded contains every ASCII character, the RootNode will be node number 512. In the unlikely event that all nodes have the same Weight, Node 0 and Node 1 will have nine-bit codes, and all the rest will have eight-bit codes, resulting in an encoded file that's larger than the input file.

If the user requests a hard copy of the nodes, the PrintNodes Sub is called and passed the RootNode. This routine formats data from non-zero nodes and sends them to the printer, where I employ another user-defined array:

   Private Type CharCodes 
      Code As Integer 
      CodeBits As Integer 
   End Type 
   Dim Codes(257) As CharCodes 


The Code parameter holds the Huffman code for the symbol. The CodeBits parameter holds the number of bits in the Huffman code.

Binary trees are great for decoding, but not for encoding in my architecture, so this routine recursively walks the tree in pre-order and loads the Codes array with the Huffman code and the number of bits in the code. The initial arguments are CodeSoFar = 0, Bits = 0, and RootNode. These parameters are pushed on the stack each time the routine is recursively called until a leaf node is reached. The CodeSoFar parameter is shifted left one position, and the Bits are incremented by one on each call until the leaf node is reached. This leaf will be Child 0. At this point, the CodeSoFar is saved as the leaf's Code, and the Bits are saved as the leaf's CodeBits. By ORing a 1 into CodeSoFar, we save the Code and CodeBits for Child 1. When the next recursion takes place, you can't go down any further, so the routine pops the values off the top of the stack and backs up to the current node's parent. At this point, it takes the 1 branch instead of the 0 branch. This process continues until all codes have been generated.

The PrintModel subroutine prints information about the compression model. I used the other print options for debugging, but this option gives the big picture. PrintModel displays a node number, a possible descriptor, SavedWeight, and the code stream. For internal nodes, the routine gives node number, SavedWeight, and the Child nodes. For descriptors, PrintModel calls the PrintCharacter subroutine. The descriptor will be the printable ASCII character, NULL, LF, CR, Space, EOS, or nothing. PrintModel also calls Hex2Bin to display the code stream. The OutputCounts subroutine reduces the overhead in the encoded file by compressing runs of zero counts. The scheme for saving weights is to output FirstNode, LastNode, and the intervening Node().SavedWeights. Outputting a single weight takes three integers.

It searches the first 256 nodes of the nodes array in ascending order, looking for a node with a non-zero SavedWeight and assigns that node to FirstNode. The search continues, but now it looks for a node with a zero SavedWeight and assigns that node to LastNode -1. NextNode is loaded with the node in LastNode +1, and the search continues looking for more nodes with zero weights. If there are more than three zero SavedWeights in a row, FirstNode, LastNode, and the intervening SavedWeights are sent to the output file. FirstNode is then loaded with NextNode +1 and continues as before. If there were less than four zero SavedWeights in a row, LastNode would be loaded with NextNode, and LastNode would continue looking for a node with a zero SavedWeight. The program uses the EndWeightStream (EWS) constant of &HFFF to mark the end of the saved weights.

At this point, the program opens the File to be encoded again and calls the CompressFile subroutine. This routine initializes the Bitio.Mask variable to &H80, reads each byte in the file again, and passes the Huffman code and number of bits in the code to BitHandler. When the input file reaches EOF, the program passes the EOS code and number of bits to BitHandler. It's responsible for updating the percent complete display and seeing that the Bitio.Rack is flushed to the encoded file.

BitHandler is the subroutine that saves the Huffman codes in the encoded file. The Huffman algorithm reads symbols and outputs bits one at a time, but VB can't save data one bit at a time. BitHandler uses the following structure to get around this problem:

   Private Type BitFile 
      Mask As Byte  
      Rack As Byte 
      OutputCount As Long 
   End Type 
   Dim Bitio As BitFile 


The program loads the "Rack" with the code bits. When the Rack is full, the program writes it into the output file. The Bitio.Mask value points to the bit of interest in the Bitio.Rack. The value iBitCount is used to left-shift the SymbolMask, which points to the bit of interest in the iSymbol. If the SymbolMask is ANDed with iSymbol, and if the result is non-zero, then the Bitio.Mask is ORed with the Bitio.Rack to place a 1 in the Bitio.Rack. This process continues until the SymbolMask is shifted right to 0. If the Bitio.Mask gets shifted right to 0 before the SymbolMask gets to 0, Bitio.Rack is saved to the encoded file, Bitio.Mask is reset to &H80, and the Bitio.Rack is reset to 0. The routine returns to CompressFile to get a new code to save. The program calculates compression percentage using this formula:

 100 - ((encoded length / input file length) * 100 ) 


A negative compression percentage indicates that the encoded file is bigger than the input file by the negative percentage.

The VB decoder
To use expand.exe, you identify the file to decompress and the output path, as you did with compress.exe. Fortunately, expansion is a good deal simpler than compression. The VB decoder begins by opening the encoded input file and the expanded output file for binary access.

The InputCounts function uses a Nodes array like the encoder does. It reads the FirstNode integer and the LastNode integer and uses a For loop to load the count integers into the Nodes array. This For loop is inside an endless Do loop, which exits when FirstNode contains EndWeightStream (&H7FFF). The function returns an integer containing the bytes read. The returned integer is used to determine the ProgressStep for the ProgressDisplay. The BuildTree function is exactly like the one used by the encoder. The trees in the encoder and decoder must be exactly alike or the expanded data will be unintelligible.

The ExpandData routine sets the Bitio.Mask value to &H80 and starts an endless Do loop that exits when the program reads EOS. A nested Do loop requests a bit from InputBit routine. The program uses the bits received to walk through the tree to a leaf. Because all leaf nodes are all less than EOS (256), the loop exits and the leaf symbol is output to the expansion file. This routine also calls ProgressDisplay to indicate expansion progress.

The InputBit function reads a byte from the compressed file and loads the Bitio.Rack. The value Bitio.Mask is ANDed with Bitio.Rack, and the result is placed in the Value variable. This variable is non-zero if the mask and rack contain a 1 in the bit of interest. The program then shifts Bitio.Mask right one position. The function returns True if the rack contains a 1 or False if the rack is 0. If the Bitio.Mask = &H00, the mask is reset to &H80. The ProgressDisplay routine works exactly like the one in the encoder.

Summary
The average compression produced by compress.exe is about 30 percent, which is in line with the figures I've seen for programs written in C. I've used it to compress text files, Word files, and Access files, but it should also work with binary, spreadsheet, and graphics files. The worst compression I got was with two Access databases greater than 1M in size. Huffman compression isn't a speed burner, because it reads the input file twice and does multiple file accesses for a single byte of data. Understanding, not speed, was my main concern when I wrote the programs. You can probably make them a lot faster by changing the way the programs buffer input and output. Error handling is the other obvious area where there's room for improvement. I've listed some resources for compression algorithms in the "Resources" sidebar. Have fun compressing and decompressing.

Download HUFF.exe

Wally Foulkes is a retired FAA software manager who worked on the program used to control air traffic and the diagnostic software for an IBM 3083 computer. He has also written COBOL report programs for a Tandem computer -- programs that were eventually rewritten in Basic! Wfoulkes@email.msn.com.


Sidebar: Dictionary Pair Compression

Rod Stephens

Algorithms such as the Huffman compression technique use patterns in the data to compress it. The most complicated parts of the Huffman algorithm are involved in identifying the most frequently used patterns in the data and in converting them into suitable codes.

Sometimes if you know something special about the data you're compressing, you can make this a little easier. For example, if you know the data represents English text, you automatically know a lot about it. You know that the five vowels -- a, e, i, o, and u -- occur fairly often; you know that r and s appear more often than q and z; you know that q is generally followed by u.

You also know that certain letter pairs are more common than others. You can take advantage of this fact to write a relatively simple compression algorithm.

Suppose the text to be compressed contains only printable ASCII characters. These characters lie between <space> (ASCII 32) and ~ (ASCII 127). These 96 characters don't use up all of the 256 possible values you can store in one byte. There are 160 other values that are unused.

The idea behind this encryption technique is to build a dictionary of 160 common letter pairs. As the program scans the message to compress, it tries to find pairs of letters in the dictionary. If it succeeds, the program stores the index of the pair in the compressed message. If the pair isn't in the dictionary, the program stores the first letter in the compressed message. The program continues processing the text until it has all been compressed.

If the program simply stored a character's ASCII code or a pairs index in the dictionary, there would be some overlap. To prevent that, the program maps these values into different values within the range 2 to 255, as shown in Figure 1a. ASCII codes are mapped to the beginning of the range, and pair codes are mapped to the end.

To decompress the data, the program examines a byte value. If the value is 95 or smaller, it represents a character, so the program adds 32 to convert it back into an ASCII code. If the byte value is greater than 95, it's a pair code. The program uses it to look the pair up in the dictionary.

The CompressOrDecompress function shown in Listing 1a performs these conversions. It begins by initializing a dictionary string. This string is a concatenated list of 160 common letter pairs. The pairs begin with <space><space>, e<space>, <space>a, and so forth. The particular pairs in this list were chosen to work well with English text. If you change them, you'll probably reduce the effectiveness of the program on English text, though you might be able to improve compression for other kinds of text.

The program follows the method described previously to compress and decompress text. The only new twist is in the first character. The program sets the value of this character to 255 to indicate that text is compressed. Since uncompressed text must lie between <space> and ~, it can't include the ASCII value 255. The code uses this to determine whether the text it's passed needs compression or decompression.

Listing 1a. The CompressOrDecompress function does the bulk of the work. 
  
 ' Compress or decompress a text string. 
 Public Function _ 
    CompressOrDecompress(ByVal input_text As String) _ 
    As String 
 ' Dims omitted  
 ' Initialize the dictionary. 
 dictionary = _ 
    "  e  as  tinthouerhet anreesr d " & _ 
    "onn or o i y wo tontyo. neisarte" & _ 
    "ed,  ctiy  bat snd fal pensestve" & _ 
    "ngitu talehaurllcousa  mf dfoof " & _ 
    "siril  hmeg om Icehironsasiossbe" & _ 
    "depe rli Tetel nicho lilprcactut" & _ 
    "Thpaeceachh wige ebuaisursulmawa" & _ 
    "otowtsmploI solyee Cunm rtieno S" & _ 
    "diwhs.rafincademe.irplk  ury Pwo" & _ 
    "acos gams,duayavucColamowe Aoopu" 
 ' If the first character has ASCII value 255, 
 ' this is a compressed string. 
 If Left$(input_text, 1) = Chr$(255) Then 
 ' Remove the first character. 
    input_text = Mid$(input_text, 2) 
    output_text = ""    ' Start with a blank output. 
    For pos = 1 To Len(input_text) ' examine each char 
    ch_value = Asc(Mid$(input_text, pos, 1)) 
    ' See whether the ASCII value is greater than 95. 
    If ch_value > 95 Then     
    ' This is the code in the dictionary for the two  
    ' chars starting at  pos. 2 * (ch_value - 96) + 1. 
       output_text = output_text & _ 
          Mid$(dictionary, (ch_value - 96) * 2 + 1, 2) 
       CompressOrDecompress = input_text 
    Else ' This is the ASCII value of a char minus 32. 
          output_text = output_text & Chr$(ch_value + 32) 
    End If 
    Next pos 
 Else 
 ' This is a normal string. Compress it. Start with  
 ' Chr$(255) to indicate a compressed string. 
 output_text = Chr$(255) 
 pos = 1 
 Do While pos <= Len(input_text) 
    ' Consider the two chars at pos and pos + 1. 
    pair = Mid$(input_text, pos, 2) 
    ' See whether we've passed the end  
    ' of the input string. 
    If Len(pair) < 2 Then ' We have.  
    ' Set dict_pos = 0 to save this char unencoded. 
       dict_pos = 0 
    Else  ' Find this pair in the dictionary. 
       dict_pos = InStr(dictionary, pair) 
       Do While dict_pos > 0 
       'If pair is here, see whether it's at an odd index. 
          If dict_pos Mod 2 = 1 Then Exit Do 
          ' The pair is at an even position. 
          dict_pos = InStr(dict_pos + 1, dictionary,_ 
             pair) 
       Loop 
    End If 
 ' Add the pair's code or the first char to the output. 
 If dict_pos > 0 Then ' The pair is in the dictionary 
    output_text = output_text & _ 
       Chr$((dict_pos - 1) \ 2 + 96) ' Move past pair 
    pos = pos + 2 
    Else ' The pair isn't in the dictionary. 
    ' Add the first character to the output. 
       output_text = output_text & _ 
          Chr$(Asc(pair) - 32) 
       ' Move past the first char in the input text. 
       pos = pos + 1 
 End If 
 Loop ' Do While pos <= Len(input_text) 
 End If ' End if decompressing ... else ... 
 CompressOrDecompress = output_text 
 End Function
 


The Compress program, shown in Figure 2a and available in the accompanying Download file, demonstrates the CompressOrDecompress function. Enter some text and click the Compress button to make the program compress it. The program displays the length of the original and compressed text, and the percent reduction in size. It also uncompresses the compressed text to verify that the compression and decompression processes work properly. The program reduced the size of the text in Figure 2a by more than 34 percent.

Download COMPRESS.exe

Rod has written several Visual Basic books, including Ready-to-Run Visual Basic Algorithms, Second Edition, Bug Proofing Visual Basic, and Visual Basic Graphics Programming. Learn more about his books or download more than 300 example programs at http://www.vb-helper.com. RodStephens@vb-helper.com.


Sidebar: Resources