John Clarkson
Microsoft Corporation
November 1998
Summary: Describes how to build the HTML Tag Extractor tool for viewing and comparing HTML tags across large numbers of separate files using Microsoft Office 97. (14 printed pages)
If your job involves producing or maintaining HTML documents, from time to time you may want to view or compare the content of HTML tags across large numbers of separate files. This article, the first of two, describes how to build such a tool using Microsoft® Office 97 components entirely. The second article, "Creating an HTML Tag Extractor with Visual Basic 6.0," describes how to build a similar tool using Microsoft Visual Basic® version 6.0.
The Office version of the HTML Tag Extractor is packaged and delivered as a Microsoft Excel add-in. It is launched by selecting a custom menu item from the Excel Tools menu and allows a user to select a folder and choose between retrieving description metatags or title tags. The Microsoft Word FileSearch object then walks folders and provides file names. Word's Find object retrieves tag contents from each file. Finally, for each HTML file in the folder, the tool copies the file name and the content of the selected tag to an Excel 97 spreadsheet.
Option Explicit
Private Sub Workbook_AddinInstall()
'This procedure adds an item to the Tools menu on the Worksheet Menu Bar.
'This will be set to the Tools menu.
Dim objCmdBrPp As CommandBarPopup
'This will be set to the Project menu item.
Dim objCmdBtn As CommandBarButton
'Create an object variable referring to the Tools menu.
Set objCmdBrPp = Application.CommandBars("Worksheet Menu Bar").Controls("Tools")
'Add a command in the 4th position on the Tools menu.
On Error Resume Next
'This generates a run-time error if the menu item is absent.
Set objCmdBtn = objCmdBrPp.Controls("Report")
'If a run-time error is generated, add the menu item.
If Err.Number <> 0 Then
Set objCmdBtn = objCmdBrPp.Controls.Add _
(Type:=msoControlButton, Before:=4)
End If
'Disables the error handler.
On Error GoTo 0
'Set properties on the new menu item.
With objCmdBtn
.Caption = "TagXtractor"
.OnAction = "TagXtractor.ShowUserForm"
End With
End Sub
Private Sub Workbook_AddinUninstall()
'This procedure deletes an item on the Tools menu on the
'Worksheet Menu Bar.
'If the Project command exists, delete it.
On Error Resume Next
Application.CommandBars("Worksheet Menu Bar"). _
Controls("Tools").Controls("TagXtractor").Delete
End Sub
The preceding code is very similar to code discussed in the article "Creating an HTML Tag Extractor with Visual Basic 6.0."
Object | Property | Setting |
Standard module | Name | TagXtractor |
User form | Caption | MetaTag Xtractor |
User form | Name | usfTag |
Text box | Name | txtPath |
Label | Name | lblPath |
Label | Caption | Enter a Path |
Frame | Name | frmOptBtns |
Frame | Caption | Select a Tag |
First option button | Name | optDesc |
First option button | Caption | Description metatag |
Second option button | Name | optTitle |
Second option button | Caption | Title tag |
First command button | Name | cmdOK |
First command button | Caption | OK |
Second command button | Name | cmdCancel |
Second command button | Caption | Cancel |
Option Explicit
Private Sub cmdCancel_Click()
'Close the app.
End
End Sub
Private Sub cmdOK_Click()
'Call the main procedure.
Call Xtract
'Close the app.
End
End Sub
Option Explicit
Dim xlApp As Excel.Application
Dim objWorkbook As Workbook
Dim appWrd As Word.Application
Dim objDoc As Document
Dim objSln As Selection
Sub ShowUserForm()
'Called by the OnAction property in the ThisWorkbook module.
'This procedure displays the UserForm.
usfTag.Show
End Sub
Sub XlReport(strTag As String, intCounter As Integer)
'Called from FindText and Xtract.
'Data passed in provides the string to write to the worksheet;
'and which row to write to.
'This procedure copies strings into the xls.
With objWorkbook.Worksheets(1).Rows(intCounter + 3)
.Cells(, 1).Value = Application. _
FileSearch.FoundFiles(intCounter)
.Cells(, 2).Value = strTag
End With
End Sub
Function Tag(strSln As String) As String
'Called from FindText.
'Data passed in provides the string to work with.
'This procedure strips unneeded text from the beginning
'and end of the found HTML text
Dim lngTagSelection As Long
'Function call. What is the option button selection?
lngTagSelection = TagSelection
If lngTagSelection = -1 Then
'Clean up the title tag.
Tag = Mid(strSln, 8, (Len(strSln) - 17))
ElseIf lngTagSelection = 0 Then
'Clean up the description meta tag.
Tag = Mid(strSln, 34, (Len(strSln) - 36))
End If
End Function
Sub Xtract()
'Called from the Click event on the UserForm.
'This procedure copies the content of HTML tags to an .xls file.
'Initialize the Excel worksheet.
Call SetupWorksheet
'Open Word.
Set appWrd = New Word.Application
appWrd.Visible = True
Dim lngTagSelection As Long
'Function call. Which tag are we looking for?
lngTagSelection = TagSelection
Dim intCounter As Integer
Dim strFlNm As String
Dim strNotAFile As String
'Loop though files.
With Application.FileSearch
'User entry in the text box
'determines what folder to search in.
.LookIn = usfTag.txtPath.Text
.FileType = msoFileTypeAllFiles
.Execute
'Loop through all files in the folder.
For intCounter = 1 To .FoundFiles.Count
'Get a filename.
strFlNm = .FoundFiles(intCounter)
'If it's an HTML file, open the file and
'look for the tag.
If VerifyHTM(strFlNm) = True Then
Set objDoc = appWrd.Documents. _
Open(FileName:=strFlNm, Format:=wdOpenFormatText)
Call FindText(intCounter, lngTagSelection)
'Close the file and loop back.
objDoc.Close
'If it's not an HTML file, add a note to the report.
Else
strNotAFile = "not an HTML file"
Call XlReport(strNotAFile, intCounter)
End If
Next intCounter
End With
'We're done. Clean up object references.
Set xlApp = Nothing
Set objWorkbook = Nothing
Set appWrd = Nothing
Set objDoc = Nothing
Set objSln = Nothing
End Sub
Sub FindText(intCounter As Integer, lngTagSelection As Long)
'Called from Xtract.
'Data passed in indicates current row in the .xls,
'and which tag we're searching for.
'This procedure finds the text in the appropriate tag.
Set objSln = appWrd.Selection
objSln.Find.ClearFormatting
'Find either the title tag or description metatag.
With objSln.Find
If lngTagSelection = -1 Then
.Text = "<title>"
ElseIf lngTagSelection = 0 Then
.Text = "<META name=" & Chr$(34) & "description" & Chr$(34)
End If
.Forward = True
End With
Dim bolFound As Boolean
Dim strTag As String
'If the Find is successful, copy data to the .xls.
bolFound = objSln.Find.Execute
If bolFound = True Then
objSln.MoveDown unit:=wdParagraph, Extend:=wdExtend
Call XlReport(Tag(objSln.Text), intCounter)
'If the Find is not successful, add a note to the .xls.
Else
strTag = "no tag found"
Call XlReport(strTag, intCounter)
End If
End Sub
Function TagSelection() As Long
'Called from Tag and FindText.
'This procedure determines which option button is selected,
'indicating which tag to find.
If usfTag.optTitle.Value = -1 Then
'Find title tags
TagSelection = -1
ElseIf usfTag.optDesc.Value = -1 Then
'Find description metatags.
TagSelection = 0
End If
End Function
Sub SetupWorksheet()
'Called from Xtract.
'This procedure initializes the worksheet containing search results.
Set xlApp = Excel.Application
xlApp.Visible = True
Set objWorkbook = xlApp.Workbooks.Add
'Add header text.
objWorkbook.Worksheets(1).Cells(1, 1).Value = "Filename"
objWorkbook.Worksheets(1).Cells(1, 2).Value = "Tag"
Dim rngHeaders As Excel.Range
'Widen column 'a.'
Set rngHeaders = objWorkbook.Worksheets(1).Range("a1")
rngHeaders.ColumnWidth = 30
'Format the header text.
Set rngHeaders = objWorkbook.Worksheets(1).Range("a1:b1")
rngHeaders.Font.Bold = True
rngHeaders.Font.Color = vbRed
rngHeaders.Font.Size = 14
End Sub
Function VerifyHTM(strFlNm As String) As Boolean
'Called from Xtract.
'This procedure determines whether we're looking at an HTML file.
Dim strExtension3 As String
Dim strExtension4 As String
'File extension could be 3 or 4 characters.
strExtension3 = Right(strFlNm, 3)
strExtension4 = Right(strFlNm, 4)
'If it looks like an HTML file, return a TRUE.
If strExtension3 = "htm" Or strExtension3 = "HTM" _
Or strExtension4 = "html" Or strExtension4 = "HTML" Then
VerifyHTM = True
End If
End Function
The following sections provide some explanation on the preceding code.
With Office object models, use the Word FileSearch object to programmatically search folders and find files. With the FileSearch object, use the Lookin method to set the folder to search, the SearchSubFolders property to search subfolders, the FileType property to search for different types of files, and the FileName property to search for specific files.
The following code fragment shows the FileSearch object looping through the folder specified by the user in her entry in the text box on the UserForm, and then passing HTML file names to the Open method of the Word Document object:
With Application.FileSearch
.LookIn = usfTag.txtPath.Text
.FileType = msoFileTypeAllFiles
.Execute
For intCounter = 1 To .FoundFiles.Count
strFlNm = .FoundFiles(intCounter)
If VerifyHTM(strFlNm) = True Then
Set objDoc = appWrd.Documents. _
Open(FileName:=strFlNm, Format:=wdOpenFormatText)
Call FindText(intCounter, lngTagSelection)
'Close the file and loop back.
objDoc.Close
.
.
.
End If
Next intCounter
End With
The easiest way to get the FileSearch object to search a specific folder is to have the user input a folder path, in this case by entering a path in a text box. The Text property of the UserForm text box is passed to the Lookin method to determine the folder to search. The Execute method begins the search process. Note that if the Execute method is omitted, no search is performed.
Because the FileType property doesn't allow us to specifically search for HTML files, we set msoFileTypeAllFiles as a parameter in order to return everything in the folder, and then use the following Function procedure to check file extensions in order to determine whether the file is an HTML file before opening it. The Right function is used to set the values of two different variables, checking for both three– and four-letter file extensions, and then the text is checked for both uppercase and lowercase occurrences:
Function VerifyHTM(strFlNm As String) As Boolean
Dim strExtension3 As String
Dim strExtension4 As String
strExtension3 = Right(strFlNm, 3)
strExtension4 = Right(strFlNm, 4)
If strExtension3 = "htm" Or strExtension3 = "HTM" _
Or strExtension4 = "html" Or strExtension4 = "HTML" Then
VerifyHTM = True
End If
End Function
The worksheet is initialized by the SetupWorksheet procedure in the standard module. These statements create a new workbook and set a reference to the objWorkbook object variable:
Set xlApp = Excel.Application
xlApp.Visible = True
Set objWorkbook = xlApp.Workbooks.Add
These statements use the Value property of the Range object to add header text, and various properties of the Font object to format the text:
objWorkbook.Worksheets(1).Cells(1, 1).Value = "Filename"
objWorkbook.Worksheets(1).Cells(1, 2).Value = "Tag"
Set rngHeaders = objWorkbook.Worksheets(1).Range("a1:b1")
rngHeaders.Font.Bold = True
rngHeaders.Font.Color = vbRed
rngHeaders.Font.Size = 14
These statements redefine the Range object and use the ColumnWidth property to widen column A to an appropriate width to display file names:
Set rngHeaders = objWorkbook.Worksheets(1).Range("a1")
rngHeaders.ColumnWidth = 30
Tag text copied from the HTML files is copied to the worksheet in the XlReport procedure. This code uses the FoundFiles property of the FileSearch object to return the file name, and the Value property of the Range object to add the file name to the worksheet. Column values are passed directly to the Cells property, while row values are passed to the Rows property:
With objWorkbook.Worksheets(1).Rows(intCounter + 3)
.Cells(, 1).Value = Application.FileSearch.FoundFiles(intCounter)
.Cells(, 2).Value = strTag
End With
There are three issues in getting tag content out of the HTML files:
Our user chooses whether to search for title or description tags by selecting an option button on the UserForm. The following Function procedure returns either minus one if she chooses title tags, or zero if she chooses to retrieve description metatags:
Function TagSelection() As Long
If usfTag.optTitle.Value = -1 Then
TagSelection = -1
ElseIf usfTag.optDesc.Value = -1 Then
TagSelection = 0
End If
End Function
The return value from the TagSelection Function procedure is used in the following With...End With block to set the search text for the find operation.
With objSln.Find
If lngTagSelection = -1 Then
.Text = "<title>"
ElseIf lngTagSelection = 0 Then
.Text = "<META name=" & Chr$(34) & "description" & Chr$(34)
End If
.Forward = True
End With
The actual search is triggered by the Execute method of the Find object in the first statment to follow, and then evaluated in the following If...End If block. If the search is successful, the return value of the Tag Function procedure is passed to the XlReport procedure, where it's added to the Excel worksheet. The XlReport procedure is discussed earlier in this article in the section "Managing the Worksheet." The Tag Function procedure is discussed to follow in the section "Parsing Search Results." If the search is unsuccessful, appropriate text is passed to the XlReport procedure:
bolFound = objSln.Find.Execute
If bolFound = True Then
objSln.MoveDown unit:=wdParagraph, Extend:=wdExtend
Call XlReport(Tag(objSln.Text), intCounter)
Else
strTag = "no tag found"
Call XlReport(strTag, intCounter)
End If
The Tag Function procedure is used to strip away unnecessary text at the beginning and end of the string returned by the search. If we're searching for title tags, the Mid function removes the beginning 8 characters and trailing 9 characters from the string provided in the string variable strSln:
Tag = Mid(strSln, 8, (Len(strSln) - 17))
If we're searching for description metatags, the Mid function removes the beginning 34 characters and trailing 2 characters from the string provided in the string variable strSln:
Tag = Mid(strSln, 34, (Len(strSln) - 36))
This article provided step-by-step procedures, complete sample code, and commentary explaining the code, for a sample solution using the Word 97 and Excel 97 object models to copy the content of HTML tags to an Excel worksheet.
The same solution could easily be modified to automate retrieving other text elements from HTML files, for example hyperlinks and keywords. This solution could also be extended to perform project management tasks such as comparing the file inventory listed in a database or worksheet against what's actually stored in the folders on a Web server.