Creating an HTML Tag Extractor

John Clarkson
Microsoft Corporation

November 1998

Summary: Describes how to build the HTML Tag Extractor tool for viewing and comparing HTML tags across large numbers of separate files using Microsoft Office 97. (14 printed pages)

If your job involves producing or maintaining HTML documents, from time to time you may want to view or compare the content of HTML tags across large numbers of separate files. This article, the first of two, describes how to build such a tool using Microsoft® Office 97 components entirely. The second article, "Creating an HTML Tag Extractor with Visual Basic 6.0," describes how to build a similar tool using Microsoft Visual Basic® version 6.0.

The Office version of the HTML Tag Extractor is packaged and delivered as a Microsoft Excel add-in. It is launched by selecting a custom menu item from the Excel Tools menu and allows a user to select a folder and choose between retrieving description metatags or title tags. The Microsoft Word FileSearch object then walks folders and provides file names. Word's Find object retrieves tag contents from each file. Finally, for each HTML file in the folder, the tool copies the file name and the content of the selected tag to an Excel 97 spreadsheet.

Build the Tool

Add code to the ThisWorkBook module

  1. In Excel, open the Visual Basic Editor.

  2. In the Visual Basic Editor on the Tools menu, click References.

  3. In the References dialog box, select Word 8.0 Object Library, and then click OK.

  4. Copy the following statements and procedures to the ThisWorkBook module:
    Option Explicit
    
    Private Sub Workbook_AddinInstall()
    'This procedure adds an item to the Tools menu on the Worksheet Menu Bar.
    
        'This will be set to the Tools menu.
        Dim objCmdBrPp As CommandBarPopup
        'This will be set to the Project menu item.
        Dim objCmdBtn As CommandBarButton
        
    'Create an object variable referring to the Tools menu.
        Set objCmdBrPp = Application.CommandBars("Worksheet Menu Bar").Controls("Tools")
            
    'Add a command in the 4th position on the Tools menu.
        On Error Resume Next
        'This generates a run-time error if the menu item is absent.
        Set objCmdBtn = objCmdBrPp.Controls("Report")
        'If a run-time error is generated, add the menu item.
        If Err.Number <> 0 Then
            Set objCmdBtn = objCmdBrPp.Controls.Add _
            (Type:=msoControlButton, Before:=4)
        End If
        'Disables the error handler.
        On Error GoTo 0
        
    'Set properties on the new menu item.
        With objCmdBtn
            .Caption = "TagXtractor"
            .OnAction = "TagXtractor.ShowUserForm"
        End With
    
    End Sub
    
    
    Private Sub Workbook_AddinUninstall()
    'This procedure deletes an item on the Tools menu on the
    'Worksheet Menu Bar.
    
        'If the Project command exists, delete it.
        On Error Resume Next
        Application.CommandBars("Worksheet Menu Bar"). _
        Controls("Tools").Controls("TagXtractor").Delete
    End Sub
    

The preceding code is very similar to code discussed in the article "Creating an HTML Tag Extractor with Visual Basic 6.0."

Add modules and controls

  1. On the Visual Basic Editor Insert menu, click UserForm.

  2. On the Insert menu, click Module.

  3. If the toolbox is not already visible, click the Toolbox icon on the toolbar.

  4. Draw the following controls on the UserForm:
  5. Draw two OptionButton controls on the Frame control.

  6. Arrange the controls as shown in the illustration to follow.

  7. Set Name and Caption properties for the two modules and the controls on the UserForm, using the values listed in the following table.
Object Property Setting
Standard module Name TagXtractor
User form Caption MetaTag Xtractor
User form Name usfTag
Text box Name txtPath
Label Name lblPath
Label Caption Enter a Path
Frame Name frmOptBtns
Frame Caption Select a Tag
First option button Name optDesc
First option button Caption Description metatag
Second option button Name optTitle
Second option button Caption Title tag
First command button Name cmdOK
First command button Caption OK
Second command button Name cmdCancel
Second command button Caption Cancel

Add code to the UserForm and standard module

  1. Copy the following statements and procedures to the UserForm:
    Option Explicit
    
    Private Sub cmdCancel_Click()
        
        'Close the app.
        End
    
    End Sub
    
    
    Private Sub cmdOK_Click()
        
        'Call the main procedure.
        Call Xtract
        
        'Close the app.
        End
        
    End Sub
    
  2. Copy the following statements and procedures to the standard module:
    Option Explicit
    
    Dim xlApp As Excel.Application
    Dim objWorkbook As Workbook
    Dim appWrd As Word.Application
    Dim objDoc As Document
    Dim objSln As Selection
    
    Sub ShowUserForm()
    'Called by the OnAction property in the ThisWorkbook module.
    'This procedure displays the UserForm.
       
        usfTag.Show
        
    End Sub
    
    
    Sub XlReport(strTag As String, intCounter As Integer)
    'Called from FindText and Xtract.
    'Data passed in provides the string to write to the worksheet;
    'and which row to write to.
    'This procedure copies strings into the xls.
        
        With objWorkbook.Worksheets(1).Rows(intCounter + 3)
            .Cells(, 1).Value = Application. _
            FileSearch.FoundFiles(intCounter)
            .Cells(, 2).Value = strTag
        End With
        
    End Sub
    
    
    Function Tag(strSln As String) As String
    'Called from FindText.
    'Data passed in provides the string to work with.
    'This procedure strips unneeded text from the beginning
    'and end of the found HTML text
    
        Dim lngTagSelection As Long
    
        'Function call. What is the option button selection?
        lngTagSelection = TagSelection
        
        If lngTagSelection = -1 Then
                'Clean up the title tag.
                Tag = Mid(strSln, 8, (Len(strSln) - 17))
                
        ElseIf lngTagSelection = 0 Then
                'Clean up the description meta tag.
                Tag = Mid(strSln, 34, (Len(strSln) - 36))
                
        End If
        
    End Function
    
    
    Sub Xtract()
    'Called from the Click event on the UserForm.
    'This procedure copies the content of HTML tags to an .xls file.
        
        'Initialize the Excel worksheet.
        Call SetupWorksheet
        
        'Open Word.
        Set appWrd = New Word.Application
        appWrd.Visible = True
    
        Dim lngTagSelection As Long
        
        'Function call. Which tag are we looking for?
        lngTagSelection = TagSelection
        
        Dim intCounter As Integer
        Dim strFlNm As String
        Dim strNotAFile As String
            
        'Loop though files.
        With Application.FileSearch
            'User entry in the text box
            'determines what folder to search in.
            .LookIn = usfTag.txtPath.Text
            .FileType = msoFileTypeAllFiles
            .Execute
            'Loop through all files in the folder.
            For intCounter = 1 To .FoundFiles.Count
                'Get a filename.
                strFlNm = .FoundFiles(intCounter)
                
                'If it's an HTML file, open the file and
                'look for the tag.
                If VerifyHTM(strFlNm) = True Then
                    Set objDoc = appWrd.Documents. _
                    Open(FileName:=strFlNm, Format:=wdOpenFormatText)
                    Call FindText(intCounter, lngTagSelection)
                    'Close the file and loop back.
                    objDoc.Close
                
                'If it's not an HTML file, add a note to the report.
                Else
                    strNotAFile = "not an HTML file"
                    Call XlReport(strNotAFile, intCounter)
                End If
    
            Next intCounter
        End With
        
        'We're done. Clean up object references.
        Set xlApp = Nothing
        Set objWorkbook = Nothing
        Set appWrd = Nothing
        Set objDoc = Nothing
        Set objSln = Nothing
        
    End Sub
    
    
    Sub FindText(intCounter As Integer, lngTagSelection As Long)
    'Called from Xtract.
    'Data passed in indicates current row in the .xls,
    'and which tag we're searching for.
    'This procedure finds the text in the appropriate tag.
    
    
        Set objSln = appWrd.Selection
        objSln.Find.ClearFormatting
        
        'Find either the title tag or description metatag.
        With objSln.Find
            If lngTagSelection = -1 Then
                .Text = "<title>"
            ElseIf lngTagSelection = 0 Then
                .Text = "<META name=" & Chr$(34) & "description" & Chr$(34)
            End If
            .Forward = True
        End With
        
        Dim bolFound As Boolean
        Dim strTag As String
        
        'If the Find is successful, copy data to the .xls.
        bolFound = objSln.Find.Execute
        If bolFound = True Then
            objSln.MoveDown unit:=wdParagraph, Extend:=wdExtend
            Call XlReport(Tag(objSln.Text), intCounter)
        'If the Find is not successful, add a note to the .xls.
        Else
            strTag = "no tag found"
            Call XlReport(strTag, intCounter)
        End If
        
    End Sub
    
    
    Function TagSelection() As Long
    'Called from Tag and FindText.
    'This procedure determines which option button is selected,
    'indicating which tag to find.
    
        If usfTag.optTitle.Value = -1 Then
            'Find title tags
            TagSelection = -1
        ElseIf usfTag.optDesc.Value = -1 Then
            'Find description metatags.
            TagSelection = 0
        End If
        
    End Function
    
    
    Sub SetupWorksheet()
    'Called from Xtract.
    'This procedure initializes the worksheet containing search results.
        
        Set xlApp = Excel.Application
        xlApp.Visible = True
        Set objWorkbook = xlApp.Workbooks.Add
        
        'Add header text.
        objWorkbook.Worksheets(1).Cells(1, 1).Value = "Filename"
        objWorkbook.Worksheets(1).Cells(1, 2).Value = "Tag"
        
        Dim rngHeaders As Excel.Range
        
        'Widen column 'a.'
        Set rngHeaders = objWorkbook.Worksheets(1).Range("a1")
        rngHeaders.ColumnWidth = 30
        
        'Format the header text.
        Set rngHeaders = objWorkbook.Worksheets(1).Range("a1:b1")
        rngHeaders.Font.Bold = True
        rngHeaders.Font.Color = vbRed
        rngHeaders.Font.Size = 14
            
    End Sub
    
    
    Function VerifyHTM(strFlNm As String) As Boolean
    'Called from Xtract.
    'This procedure determines whether we're looking at an HTML file.
    
        Dim strExtension3 As String
        Dim strExtension4 As String
    
        'File extension could be 3 or 4 characters.
        strExtension3 = Right(strFlNm, 3)
        strExtension4 = Right(strFlNm, 4)
        
        'If it looks like an HTML file, return a TRUE.
        If strExtension3 = "htm" Or strExtension3 = "HTM" _
        Or strExtension4 = "html" Or strExtension4 = "HTML" Then
            VerifyHTM = True
        End If
    
    End Function
    

Load and run the add-in

  1. In Excel on the File menu, select Properties. Feel free to add your own text here, but a good choice for title is "TagXtractor" and "HTML Tag Extractor" for comments. The title identifies an add-in in Add-In Manager. The comment is displayed when an add-in is selected in Add-In Manager. Click OK.

  2. Save your project to a working folder as TagXtract.xls.

  3. On the File menu, select Save As. Set the file type to Microsoft Excel Add-In. Select the Program Files\Microsoft Office\Office\Library folder. Enter TagXtract as the file name. Click Save.

  4. Close and then restart Excel. This may not be necessary, but I find it very useful in keeping things straight.

  5. On the Tools menu, select Add-Ins. In Add-In Manager, check the box indicating the TagXtractor add-in and click OK.

  6. On the Tools menu, select TagXtractor.

  7. The UserForm displays. Select one of the option buttons, and enter a path in the text box. One useful shortcut is to do a copy and paste from the Address list box in Windows Explorer to the text box in TagXtractor. Click OK.

Examining the Code

The following sections provide some explanation on the preceding code.

Finding Files

With Office object models, use the Word FileSearch object to programmatically search folders and find files. With the FileSearch object, use the Lookin method to set the folder to search, the SearchSubFolders property to search subfolders, the FileType property to search for different types of files, and the FileName property to search for specific files.

The following code fragment shows the FileSearch object looping through the folder specified by the user in her entry in the text box on the UserForm, and then passing HTML file names to the Open method of the Word Document object:

    With Application.FileSearch
        .LookIn = usfTag.txtPath.Text
        .FileType = msoFileTypeAllFiles
        .Execute
        For intCounter = 1 To .FoundFiles.Count
            strFlNm = .FoundFiles(intCounter)
            If VerifyHTM(strFlNm) = True Then
                Set objDoc = appWrd.Documents. _
                Open(FileName:=strFlNm, Format:=wdOpenFormatText)
                Call FindText(intCounter, lngTagSelection)
                'Close the file and loop back.
                objDoc.Close
           .
           .
           .
            End If
        Next intCounter
    End With

The easiest way to get the FileSearch object to search a specific folder is to have the user input a folder path, in this case by entering a path in a text box. The Text property of the UserForm text box is passed to the Lookin method to determine the folder to search. The Execute method begins the search process. Note that if the Execute method is omitted, no search is performed.

Checking File Extensions

Because the FileType property doesn't allow us to specifically search for HTML files, we set msoFileTypeAllFiles as a parameter in order to return everything in the folder, and then use the following Function procedure to check file extensions in order to determine whether the file is an HTML file before opening it. The Right function is used to set the values of two different variables, checking for both three– and four-letter file extensions, and then the text is checked for both uppercase and lowercase occurrences:

Function VerifyHTM(strFlNm As String) As Boolean

    Dim strExtension3 As String
    Dim strExtension4 As String

    strExtension3 = Right(strFlNm, 3)
    strExtension4 = Right(strFlNm, 4)
    
    If strExtension3 = "htm" Or strExtension3 = "HTM" _
    Or strExtension4 = "html" Or strExtension4 = "HTML" Then
        VerifyHTM = True
    End If

End Function

Managing the Worksheet

The worksheet is initialized by the SetupWorksheet procedure in the standard module. These statements create a new workbook and set a reference to the objWorkbook object variable:

    Set xlApp = Excel.Application
    xlApp.Visible = True
    Set objWorkbook = xlApp.Workbooks.Add

These statements use the Value property of the Range object to add header text, and various properties of the Font object to format the text:

    objWorkbook.Worksheets(1).Cells(1, 1).Value = "Filename"
    objWorkbook.Worksheets(1).Cells(1, 2).Value = "Tag"
    
    Set rngHeaders = objWorkbook.Worksheets(1).Range("a1:b1")
    rngHeaders.Font.Bold = True
    rngHeaders.Font.Color = vbRed
    rngHeaders.Font.Size = 14

These statements redefine the Range object and use the ColumnWidth property to widen column A to an appropriate width to display file names:

    Set rngHeaders = objWorkbook.Worksheets(1).Range("a1")
    rngHeaders.ColumnWidth = 30

Tag text copied from the HTML files is copied to the worksheet in the XlReport procedure. This code uses the FoundFiles property of the FileSearch object to return the file name, and the Value property of the Range object to add the file name to the worksheet. Column values are passed directly to the Cells property, while row values are passed to the Rows property:

    With objWorkbook.Worksheets(1).Rows(intCounter + 3)
        .Cells(, 1).Value = Application.FileSearch.FoundFiles(intCounter)
        .Cells(, 2).Value = strTag
    End With

Finding Text

There are three issues in getting tag content out of the HTML files:

Title tag or description metatag?

Our user chooses whether to search for title or description tags by selecting an option button on the UserForm. The following Function procedure returns either minus one if she chooses title tags, or zero if she chooses to retrieve description metatags:

Function TagSelection() As Long

    If usfTag.optTitle.Value = -1 Then
        TagSelection = -1
    ElseIf usfTag.optDesc.Value = -1 Then
        TagSelection = 0
    End If
    
End Function

The return value from the TagSelection Function procedure is used in the following With...End With block to set the search text for the find operation.
    With objSln.Find
        If lngTagSelection = -1 Then
            .Text = "<title>"
        ElseIf lngTagSelection = 0 Then
            .Text = "<META name=" & Chr$(34) & "description" & Chr$(34)
        End If
        .Forward = True
    End With

Running the search

The actual search is triggered by the Execute method of the Find object in the first statment to follow, and then evaluated in the following If...End If block. If the search is successful, the return value of the Tag Function procedure is passed to the XlReport procedure, where it's added to the Excel worksheet. The XlReport procedure is discussed earlier in this article in the section "Managing the Worksheet." The Tag Function procedure is discussed to follow in the section "Parsing Search Results." If the search is unsuccessful, appropriate text is passed to the XlReport procedure:

    bolFound = objSln.Find.Execute
    If bolFound = True Then
        objSln.MoveDown unit:=wdParagraph, Extend:=wdExtend
        Call XlReport(Tag(objSln.Text), intCounter)
    Else
        strTag = "no tag found"
        Call XlReport(strTag, intCounter)
    End If

Parsing search results

The Tag Function procedure is used to strip away unnecessary text at the beginning and end of the string returned by the search. If we're searching for title tags, the Mid function removes the beginning 8 characters and trailing 9 characters from the string provided in the string variable strSln:

Tag = Mid(strSln, 8, (Len(strSln) - 17))

If we're searching for description metatags, the Mid function removes the beginning 34 characters and trailing 2 characters from the string provided in the string variable strSln:

Tag = Mid(strSln, 34, (Len(strSln) - 36))

Conclusion

This article provided step-by-step procedures, complete sample code, and commentary explaining the code, for a sample solution using the Word 97 and Excel 97 object models to copy the content of HTML tags to an Excel worksheet.

The same solution could easily be modified to automate retrieving other text elements from HTML files, for example hyperlinks and keywords. This solution could also be extended to perform project management tasks such as comparing the file inventory listed in a database or worksheet against what's actually stored in the folders on a Web server.