By Rod Stephens
When you tell someone your last name is Stephens, it's a toss up whether they will spell it Stephens or Stevens. Usually it makes little difference. The Post Office delivers my junk mail whether it is addressed to Stephens, Stevens, or Occupant.
In a large database application, however, spelling is important. If you need to locate someone named Fineas in your customer support database, it makes a big difference whether you spell their name Fineas, Phineaus, or Finney S. Finding the right entry can be challenging.
Name recognition is an important problem in applications such as customer support hotlines, phone directories, and customer demographics files. It can also be a problem when the user must spell the names of things, rather than people. For example, an address location system must allow the user to find Main St., Mane St., and Mayne St.
Unless these programs can tell that these similar names really are similar, the user may need to enter several different spellings and check each manually. That can slow the user down and make customer interactions longer, reducing productivity, and possibly annoying the customer.
This article explains the Soundex system, a method for encoding names based on how they sound rather than how they are spelled. Sounds can make name look-up faster and less frustrating for both the user and the customer.
A Sound Beginning
There are several different variations on the Soundex theme. One of the simplest was developed for the United States Census. The census-style version encodes a name as a letter, followed by three digits. The steps for creating a census-style Soundex encoding are:
For example, to encode the name STEPHENS, you first remove the vowels and H to get STPNS. Next, you replace the characters with numeric codes, giving 23152. There are no adjacent duplicates so the value is unchanged in step 3. The final result is S315.
Now consider the other common spelling of this name: STEVENS. Removing the vowels gives STVNS. Replacing the characters with numeric codes gives 23152. This also contains no adjacent duplicates, so the value is unchanged in step 3. The final encoding is S315, as before.
This demonstrates the important property of Soundex systems: names that sound similar tend to have similar Soundex encodings. If a user types the name Rod Stevens, the program can convert the last name into the Soundex encoding S315. When it looks for entries encoded as S315, the program finds names spelled Stevens as well as those spelled Stephens. If the only choices are Rod Stephens and Mike Stevens, the program can probably pick the right entry. If there are too many choices for the program to easily decide which is right, it can present the user with a list and let the user pick the right one. FIGURE 2 shows a Visual Basic function that calculates census-style Soundex encodings.
Using Soundex
There are several ways you can use Soundex encodings in your applications. One very efficient method is to store Soundex encodings in a database. To locate a customer record, the program searches the database for the desired Soundex encoding. It can then examine the records for the closest match.
For example, a program might assign scores to the records using the table shown in FIGURE 3. The higher the score, the better the match. The program can show the user a list of the names ordered by their scores.
FIGURE 4 shows Visual Basic code that computes matching scores. The function takes, as parameters, the target's first and last names and Soundex codes, and tests first and last names. It compares the test names to the target values and returns a score using the values shown in FIGURE 3.
A less efficient approach is to examine all of the records in the database and calculate their Soundex encodings when the program needs them. It can then compare the encodings to those of the target names to see which matches the best.
The file, Soundex.xls (available for download from the Informant Web site; see end of article for details), is an Excel worksheet that demonstrates this technique. Open the file and invoke the MakeData macro to load the worksheet with some random data. Then, invoke the SelectName macro. At run time, the program presents the UserForm shown in FIGURE 5.
Enter a first and last name and click the Search button. The program examines all the names on the worksheet and compares their Soundex encodings to the encodings of the targets you entered. It presents a list of reasonable matches and highlights the name with the highest score. The code that performs this search is shown in FIGURE 6.
Integer Encoding
There are several variations on the basic Soundex algorithm that's been described so far. A numeric encoding simply translates the three-character, census-style encoding into an integer. The integer encoding takes less storage space (2 bytes instead of 4) and operations on integers are faster than those on four-character strings.
You can convert a census-style Soundex code into an integer using the following formula:
integer code = (letter - A) * 1000 +
(first digit) * 100 +
(second digit * 10) +
(third digit)
FIGURE 7 shows a Visual Basic function that uses this formula to convert census-style strings into integer codes.
Extended Soundex
While the census-style system is reasonably simple, it has some disadvantages. The fact that it uses short codes means it cannot distinguish between long names that initially sound alike. For example, Beckers and Beckerson both have code B262. If a user enters Beckers into a program using this scheme, the program will decide that Beckerson is a likely match, even though it probably is not the name the user has in mind.
Even with short names the algorithm sometimes produces the same code for names that sound very different. For example, Paka and Pease both have the encoding P2, though the user is unlikely to misspell Paka as Pease.
Similarly, the algorithm sometimes gives very different codes for names that sound alike. Phaltzman has code P432 while Faultsman has code F432. If the user spells this name incorrectly, the program will probably not be able to guess that the other spelling is possible.
Extended Soundex systems use more than four characters to encode names. They also replace common combinations of letters with reasonable equivalents. For example, PH becomes F. Most of these methods also do not convert characters into numeric codes. This gives the program more information to use in telling different words apart. It also makes the resulting codes a little more meaningful to humans.
One extended Soundex algorithm uses the following rules:
Using these rules, the code for Beckers is BCKRS and the code for Beckerson is BCKRSN. The codes are still similar because the names are similar. The codes are different enough, however, for the program to tell the names apart.
The code for Paka is PK and the code for Pease is PS. These codes are different so the program will not claim that Pease might be a misspelling of Paka.
Finally, the codes for Phaltzman and Faultsman are both FLTSMN. These names sound alike and they have the same code. If the user types one, the program will correctly decide that the other might be an alternative spelling. FIGURE 8 shows Visual Basic code that computes this form of encoding.
Other Variations
The extended Soundex algorithm solves many of the shortcomings of the census-style algorithm, but it creates new problems of its own. Because extended Soundex changes PH to F, Stephens has code STFNS but Stevens has code STVNS. Different Soundex systems come with their own collection of tradeoffs. Each set of rules makes some decisions correctly and others incorrectly for some words.
There are many other Soundex variations that use different sets of rules. Some change PH to F when it comes at the beginning of a word, and V otherwise. Others replace X with KS.
The metaphone algorithm uses an elaborate set of rules in an attempt to model the English phonetic rules. For example, it converts the letter C into:
Because metaphone is designed to work with English pronunciation, it doesn't always work well. It may have particular problems with non-English names. For more information on metaphone, visiy http://www.intellex.net/~wcs/delphi/program.html.
Try It
The macros included in Soundex.xls make it easy to add Soundex to Microsoft Office applications. By adding Soundex to your applications, you can take much of the guesswork out of name look ups. You can make name searching faster and less exasperating for both users and customers.
Download source code for this article here.
Rod Stephens is the author of Custom Controls Library [John Wiley & Sons, 1998], Visual Basic Graphics Programming [John Wiley & Sons, 1997], and Visual Basic Algorithms [John Wiley & Sons, 1996]. Send him e-mail at RodStephens@vb-helper.com, or download Visual Basic examples from his Web site at http://www.vb-helper.com.