Regular Expressions

Mike McMillan

Thanks to the "regular expression" pattern-matching engine introduced in VBScript version 5, VB programmers finally have the ability to do powerful text processing that programmers in languages such as Perl and the various UNIX shells have enjoyed for years. In this article, Mike McMillan shows you how to use the VBScript regular expression engine to find patterns in text–and you’ll probably learn things that will shed some light on search engines, including the special text search engines such as MSSEARCH that work with today’s SQL databases.

Aregular expression is a series of characters that define a pattern. The pattern is then compared to a target string to see whether there are any matches of the pattern in the target string. Many times, the characters in a pattern will simply match themselves in the target string, such as looking for all occurrences of the pattern "the" in "the quick brown fox chased the lazy dog." You can also use special characters, called metacharacters, to indicate character positioning, grouping, and repetition. Most of you are probably familiar with the use of the character "*" (asterisk) as a wildcard for matching any character when doing a directory search in DOS. The "*" is an example of a regular expression metacharacter.

You can also search for pattern sequences using regular expressions. For example, the regular expression "[a-c]" will match any "a", "b", or "c" in the target string.

The regular expression engine in VBScript includes several special metacharacters and sequences to allow you to do more complex pattern matching, including the following:

• The character "^" stands for the beginning of a string , so "^i" will match "is" but not "mi". "$" indicates a match at the end of a string, so "$i" will match "mi" but not "is".

• The character "*" matches the preceding character zero or more times, so the regular expression "fo*" matches both "f" and "foo". "+" matches the preceding character one or more times, so "fo+" matches "foo" but not "f". The question mark ("?") matches the preceding character zero or just one time, meaning "a?ve?" matches the "ve" in "never".

• The period (".") matches any single character except the newline character, so "a.b" matches "aab" and "a3b", but not "ab". The bar "|" is used for alternative matching, as in "a|b", which will match either "a" or "b".

• The expression "{n}" matches against the target string exactly n times. For example, "e{2}" will match "feed" but not "fed". The expression "{n,m}" will match against the target string at least n times but not more than m times. "o{1,3}" will match all the o’s in "food" or "sod", but will only match the first three o’s in "soooooie".

Brackets are used to express character and digit sets and ranges. For example, the expression "[abcd]" will match any of the enclosed characters in the target string. The whole lower-case alphabet can be expressed using "[a-z]". To include both upper- and lower-case letters in the regular expression, write the expression like this: "[a-zA-Z]". To search for digits in a string, use "[0-9]".

Negative character and digit sets and ranges can also be expressed. "[^abc]" will match any characters not enclosed in brackets. You can also write an expression for a negative range, such as "[^m-q]".

Using the regular expression engine

Starting with VBscript version 5, there’s a RegExp object that encapsulates the methods and properties needed to do regular expression pattern matching. You set up the prerequisites for pattern matching by first creating a RegExp object and supplying a pattern and a target string. To actually perform pattern matching, you call the Execute method, which returns a Matches collection, the contents of which are the matches made by the regular expression engine. If no matches are found, an empty Matches collection is returned.

Now that the preliminaries are over with, we can start working with regular expressions and exploring the other methods and properties of the RegExp object. The following code fragment gets things started:

Dim regex, match, matches, ptrn, strng
ptrn = "as." 
strng = "as1 As2 aS3 AS4"
Set regex = New RegExp
regex.Pattern = ptrn

Before a regular expression can be executed, the Pattern property of the RegExp object has to be assigned a value. In this example, we’re looking for occurrences of "as" in the target string. This pattern is assigned to the Pattern property. Other properties we might want to consider setting before executing the pattern match include the IgnoreCase property and the Global property. Setting IgnoreCase to True will cause the regular expression engine to perform matches regardless of case, while setting the Global property to True will cause the whole string to be searched and not have the engine stop after finding the first occurrence of the pattern. The default value for both of these properties is False, so you’ll need to explicitly set them to True. (Note: This is in direct contradiction to the VBScript documentation, which states that the default values for IgnoreCase and Global are True.) Here’s the code fragment that sets these properties:

regex.IgnoreCase = True
regex.Global = True

Once you’ve set all of the properties of the RegExp object, you’re ready to execute the regular expression engine using the Execute method. Since this method returns a collection, you have to assign its return value to a variable. Here’s the code fragment:

Set matches = regex.Execute(strng)

When this statement is executed, the pattern is compared to the string and any matches are returned. You can then pull each match out of the collection to examine them. Here’s a code fragment for doing that with our example:

For Each match in matches
   retstr = retstr & "Match found at position "
   retstr = retstr & Match.FirstIndex & ". _
      Match value is '"
   retstr = retstr & Match.Value & "'." & vbCrLf
Next
MsgBox retstr

Putting these code fragments together and running the code results in this string:

Match found at position 0. Match value is 'as1'.
Match found at position 4. Match value is 'As2'.
Match found at position 8. Match value is 'aS3'.
Match found at position 12. Match value is 'AS4'.

When the regular expression engine is executed, the matches are placed into the matches collection. The position of where a match is found is stored in the FirstIndex property. This property stores the zero-based offset that indicates where each match is found. That’s why the preceding output indicates that the first match was found at position 0. The matched string that’s found by the regular expression engine is stored in the Value property. You can use this property to examine what strings are returned by the regular expression engine.

Performing text substitutions

One of the most common tasks in text processing is text substitution, where you want to replace substring1 with substring2. A regular expression engine does this by storing substring1 as the pattern and putting substring2 in its place whenever it matches substring1. The RegExp object has a method for performing text substitutions–the Replace method.

The Replace method takes two arguments: the string on which to perform the substitutions and the replacement substring. The substring that you’re searching for is stored in the Pattern property of the RegExp object before the Replace method is called. The following code performs a text substitution for a single occurrence of a word:

Dim regex, ptrn, repstr, str1
ptrn = "men"
repstr = "people"
str1 = "Now is the time for all good men to come"
str1 = str1 & " to the aid of their party"
set regex = New RegExp
regex.Pattern = ptrn
str1 = regex.Replace(str1, repstr)
MsgBox str1

In this example, the word "men" is replaced with the word "people." A variation of this operation is to perform multiple substitutions of a substring with another substring. To do this, you have to set the Global property to True so that each occurrence of the found string is replaced. Here’s an example:

Dim regex, ptrn, repstr, str1
ptrn = "we"
repstr = "they"
str1 = "what we know we know we know"
str1 = str1 & " to the aid of their party"
set regex = New RegExp
regex.Pattern = ptrn
regex.Global = True
str1 = regex.Replace(str1, repstr)
MsgBox str1

str1 becomes "what they know they know they know."

Testing for a match

There will be situations when you want to determine that a match will be found with a regular expression before any further processing takes place. You can do this by calling the Test method. This method is similar to using the Execute method, except you can only use it to determine whether a match will be found; any processing you want to do with the results will have to be performed using the Execute method. Here’s how the Test method can be used:

Dim regex, ptrn, repstr, str1
ptrn = "we"
repstr = "they"
str1 = "what we know we know we know"
str1 = str1 & " to the aid of their party"
set regex = New RegExp
regex.Pattern = ptrn
If regex.Test(str1) Then
   regex.Global = True
   str1 = regex.Replace(str1, repstr)
   MsgBox str1
End If

Miscellaneous regular expressions

The RegExp object includes several pattern operators that I haven’t covered so far. One is the "\s" operator, or the white space operator. This operator matches any white space found in a string. One interesting use of this operator is to count the words in a string. If you define a string as being a sequence of words separated by one or more characters of white space, you can easily write a regular expression to count the words in a string. Here’s one example:

dim regex, ptrn, strng, wc, matches
set regex = New RegExp
ptrn = "\s"
strng = "This string is six words long"
regex.Pattern = ptrn
regex.Global = True
set matches = regex.Execute(strng)
wc = matches.Count
MsgBox "Found " & wc+1 & " word matches"

The regular expression engine matches the white space found in the string. The total number of matches is stored in the Count property of the matches object. This number plus one equals the number of words in the string, assuming that there’s no white space at the end of the string. You could test for white space at the end of the string by testing for this pattern: "$\s".

Another interesting operator is the pattern operator ("()"). When a regular expression is enclosed within parentheses, the regular expression engine will "remember" each match, and it can be retrieved using the Item property. For example, the following code will return the third match found by the regular expression engine:

dim regex, ptrn, strng, wc, matches
set regex = New RegExp
ptrn = "(sh)"
strng = "she shucks sea shells by the sea shore"
regex.Pattern = ptrn
regex.Global = True
set matches = regex.Execute(strng)
retstr = matches.item(2)
MsgBox retstr

Summary

This article has barely scratched the surface of what you can do with regular expressions. While many people view regular expressions as not much more than a collection of pattern matching operators, they’re really a language in their own right and can be used to express some very complex patterns that would be hard, if not downright impossible, to express in a language like VBScript. Microsoft’s decision to incorporate a regular expression engine into VBScript represents, in my opinion, a decision to make VBScript into a serious scripting language that can begin to hold its own against a more mature scripting language like Perl. The combination of VBScript and Windows Scripting Host gives programmers a powerful tool for scripting in the 32-bit Windows environment.

Download REGCODE.txt

Mike McMillan is an instructor of computer information systems at Pulaski Technical College in North Little Rock, AR. He is currently writing a book on VBScript and Windows Scripting Host for Prentice-Hall. Mike now has almost 20 years of programming experience in Basic and its various dialects, having started out with TRS-DOS Basic on the Radio Shack Model II while he was an undergraduate at the University of Arkansas. mmcm@swbell.net.