David Shank
Microsoft Corporation
December 9, 1999
Using the terms "Office" and "HTML" in the same sentence used to be a rarity. Occasionally you'd hear someone complain, "I wish I could save my Office document as an HTML file," but that was about it.
The combination of the widespread use of Microsoft Office and the widespread use of the Internet brought requests that Office documents satisfy the best of both worlds. It really wasn't until the release of Microsoft Office 2000 that we got used to the idea that an Office document and an HTML document could be one and the same. Office documents now support HTML as a native file format and use XML to preserve information about a document necessary to re-create it as an Office document once it has been saved as HTML. There really isn't anything special about the HTML in an Office document, so working with it is just like working with any HTML. By using HTML and XML, Office documents and data can be stored, distributed, and viewed using most Web browsers, while retaining the functionality of Office documents.
Certain HTML and XML tags help "round-trip" the document for editing purposes. For example, if you create the document in Word 2000 and save it as HTML, the code embedded in the document allows you to re-open the document in Word 2000. (Read the reference for more information on the Office HTML and XML file formats.)
But alas, every silver lining contains a dark cloud (or something like that). The bad news, to some folks at least, was that these HTML files could get quite large. If you don't need to "round-trip" the document, there is no need to preserve the Office-specific HTML and XML. You can learn more about Office HTML and XML and download a tool (http://www.officeupdate.com/2000/downloadDetails/Msohtmf2.htm?s=/downloadCatalog/dldWord.asp) for removing Office-specific markup tags embedded in Office 2000 documents.
This month, I'm going to talk about working with the HTML in Office documents and show you how you can remove unwanted HTML or XML from an Office document before programmatically saving a document as an HTML page. We'll use the Word object model and the FrontPage object model to work with HTML in each application.
The scenario I will use is designed to illustrate how to take the portions of the HTML from a Word document and create a Web page based on that HTML. To be more specific, imagine you are working on a Word document and you want to run a Visual Basic for Applications procedure in Word to save your document as a page in a FrontPage Web site.
To accomplish this goal, I'll write Visual Basic for Applications code to:
To follow along with me in this example, first open a new Word document, add some text, and assign styles and formatting to the text. The following graphic shows a Word document formatted in this manner.
Next, press ALT-F11 to open the Visual Basic Editor, and then choose Module from the Insert menu. Set the module Name property in the lower-left pane to modOfficeHTML.
Next, choose References from the Tools menu and set a reference to these two type libraries:
Setting these references to the FrontPage Page and Web object models allows you to work with the objects in those type libraries as they are used in the sample code below.
Short of actually saving an Office document as an HTML page, the only way to see its HTML representation is to view the document using the Microsoft Script Editor. (You can press ALT-SHIFT-F11 to open the script editor, but you don't have to do so to make this example work.)
But we don't want to look at the HTML; we want to work with it programmatically. To do this, use the HTMLProjectItem object. In this case, I will use the Text property of the HTMLProjectItem object to get the HTML for the active Word document. What I want to do is remove only the <BODY> and <STYLE> portions of the HTML so I can insert them into an existing Web page. I use the GetHTMLPart function to do this work.
Function GetHTMLPart(wdDoc As Word.Document, strStartTag As String, _
strEndTag As String) As String
Dim strText As String
Dim lngStartPos As Long
Dim lngEndPos As Long
'This procedure returns the portion of the HTML in a Word document
' beginning with the HTML tag in the strStartTag variable and
' ending with the HTML tag in the strEndTag variable.
' First get all of the HTML in the document.
strText = wdDoc.HTMLProject.HTMLProjectItems(wdDoc.Name).Text
' Locate the position of the starting tag.
lngStartPos = InStr(strText, strStartTag)
' Locate the position of the ending tag.
lngEndPos = InStr(strText, strEndTag) + Len(strEndTag)
' Return the HTML between the starting and ending tags.
GetHTMLPart = Mid$(strText, lngStartPos, lngEndPos - lngStartPos)
End Function
An example will help illustrate how this function works. The following code returns all the HTML within the document's <BODY></BODY> tag pair to the strBodyHTML variable:
strBodyHTML = GetHTMLPart(Activedocument, "<body", "</body>")
The procedure works because there is only one <BODY></BODY> tag pair in any document. Note how in the code the closing bracket is left off of the start position tag argument. This is because there is other text in the opening tag, and we may not be sure what that text is or where that tag ends. All we really care about is where the tag begins. In addition, note how the tag name in the actual code is lowercase. That is because the tags in an Office document's HTML are all in lower case. If you passed "<BODY" as the first argument, the GetHTMLPart function would not return any HTML.
The GetHTMLPart procedure locates the starting and ending points for this tag pair and then uses the Mid$ function to return only the HTML between the starting and ending points. This same function can be used to extract the HTML style sheet information represented by the <STYLE></STYLE> tag pair:
strStyleHTML = GetHTMLPart(Activedocument, "<style", "</style>")
That is really all there is to it. By extracting the <BODY> and <STYLE> HTML, we leave behind any Office-specific HTML or XML. Now all we need to do is insert the extracted HTML into a Web page.
In our scenario, we want to add a new Web page to a FrontPage Web site. We will need a function that checks a URL to see whether it represents an existing FrontPage Web site. If it does, we will add a new page to that site. If it does not, we will create a new FrontPage Web site, and then we'll add a new page to it.
The following function accepts a single argument representing the URL to a FrontPage Web site.
Function CreateNewWeb(strPath As String) As FrontPage.web
Dim wdNewWeb As FrontPage.web
On Error Resume Next
' Check to see if strPath represents an existing FrontPage web.
If FrontPage.Webs.Count > 0 Then
For Each wdNewWeb In FrontPage.Webs
If UCase(wdNewWeb.Url) = UCase(strPath) Then
' Return pointer to the existing FrontPage web
' and exit this function.
Set CreateNewWeb = wdNewWeb
Exit Function
End If
Next wdNewWeb
End If
' The url in strPath does not reference an existing web so
' add this web to the Webs collection.
Set CreateNewWeb = FrontPage.Webs.Add(strPath)
End Function
The CreateNewWeb function steps through each Web object in the Webs collection, comparing the URL in the strPath argument with the URL for each Web object. If a match is found, the function returns the matching Web object to the calling procedure. If no match is found, the function uses the Webs collection Add method to create a new Web site using the folder path contained in strPath. The following code illustrates how to call the CreateNewWeb function using a URL to a folder on the C: drive (note that all the variables needed for code in the remainder of this article are dimensioned here as well):
Dim strWebPath As String
Dim strNewPage As String
Dim wbWeb As FrontPage.Web
Dim wbFile As FrontPage.WebFile
Dim wbFiles As FrontPage.WebFiles
Dim pwWindow As FrontPage.PageWindow
Dim fpDoc As FrontPageEditor.FPHTMLDocument
Dim fpBody As FrontPageEditor.FPHTMLBody
strWebPath = "file:///C:/Example Webs/WebSampleOne"
Set wbWeb = CreateNewWeb(strWebPath)
Now that we have a variable representing a FrontPage Web site on disk, we can add our new Web page to that site. However, because the CreateNewWeb can return either a new Web site or a pointer to an existing site, we should ensure that our new page does not already exist before we try to create it.
You can use the Web object's LocateFile method to find an existing Web page. If the Web page we want to create already exists, we want to display a message box asking whether the file should be overwritten. If the Web page does not exist, we want to create it.
On Error Resume Next
strNewPage = "WordDoc.htm"
Set wbFile = wbWeb.LocateFile(strNewPage)
If Not wbFile Is Nothing Then
If MsgBox("The file named '" & strNewPage & "' _
& "already exists. Do you want to replace it " _
& "with the active document?", vbOKCancel, _
"Replace existing file?") = vbCancel Then
Exit function
End If
Else
Set wbFiles = wbWeb.RootFolder.Files
Set wbFile = wbFiles.Add(strNewPage, False)
End If
At this point, the wbFile object variable represents the Web page on the FrontPage Web site that we want to use to insert the HTML from our active Word document.
Since we may be working with a preexisting page, and since we are going to add <STYLE> information from our active Word document, we have to strip out any existing <STYLE> tags in the page.
To do this work, I've written the StripTags function, which is designed to remove any specific collection of HTML tags in a Web page. You pass in the FrontPage Document object you want to work with and the name of the tag you want to remove.
You can create a Document object using the document property of a FrontPage PageWindow object. You can get the PageWindow object using the Edit method of the FrontPage WebFile object.
' Get the PageWindow object.
Set pwWindow = wbFile.Edit(fpPageViewNormal)
' Get the Document object.
Set fpDoc = pwWindow.Document
' Remove preexisting <style> tags.
If fpDoc.all.tags("style").Length > 0 Then
StripTags fpDoc, "style"
End If
Function StripTags(fpDoc As FrontPageEditor.FPHTMLDocument, _
strTagName As String) As Boolean
Dim intTag As Integer
For intTag = fpDoc.all.tags(strTagName).Length - 1 To 0 Step -1
fpDoc.all.tags(strTagName).Item(intTag).outerHTML = ""
Next intTag
End Function
Now that we have cleaned up the Web page, all we have left to do is to add the <STYLE> and <BODY> portions of the active Word document to the Web page.
Most of the work required to get the Word HTML out of the active document and into the FrontPage Web page is done by the GetHTMLPart function discussed at the start of this article. Remember that the GetHTMLPart function returns a portion of the HTML from the Word document. In this case, we need it to return the <STYLE> and the <BODY> portions of the Word document. Once we have that HTML from the Word document, we need insert only the <STYLE> information within the <HEAD></HEAD> tag pair and replace the existing <BODY> HTML with the <BODY> HTML from the Word document.
The following code illustrates how to replace the <BODY> portion of the page with the <BODY> HTML from the Word document:
Set fpBody = fpDoc.body
fpBody.outerHTML = GetHTMLPart(ActiveDocument, "<body", "</body>")
The last step shows how to replace the <STYLE> portion of the page with the <STYLE> HTML from the Word document:
fpDoc.all.tags("head").Item(0).insertAdjacentHTML "BeforeEnd", _
GetHTMLPart(ActiveDocument, "<style", "</style>")
Finally, to save the changes to the Web page, use the Close method of the PageWindow object, where True is the method’s argument:
pwWindow.Close True
If we ran this code against the Word document shown in the first section of this article, we would create a Web page that looks remarkably similar to the original Word document, but without the unnecessary HTML or XML that Word inserts in its documents:
Check out the following links for more information on working with FrontPage: