Walking Through the Object Model

When enumerating the images that actually make up an HTML page, we can rely on a built-in collection exposed by the document object. As explained earlier, any Dynamic HTML collection enriches the name of the files with its original path and protocol. Our purpose here is to arrange a packaging procedure for Scriptlets. So we're interested in all the files that contribute to the correct working of the Scriptlet itself. Plus, we want to identify all the local files. In fact, if the Scriptlet refers to a remote image or control you don't need to distribute it, but can rely on the browser's capabilities to download it and register it properly.

For a more readable user interface, we remove the protocol from the actual file name—as mentioned all the files we're working with are local.

Enumerating the Images

The enumeration touches three main categories of elements: images, objects, and links. Images are the simplest to handle. We only have to consider the collection of the <IMG> tags and the background image.

Even if the Dynamic HTML object model exposes a collection called document.images, this will not include all the possible images. In fact, such a collection is limited to the files referenced through an <IMG> tag. On the other hand, an image used as the page background is an image in effect, and must be considered. (For Scriptlets, however, having a background image is a rarity.)

Private Sub EnumerateImages(ByVal doc As HTMLDocument, ByVal list As ListBox)
On Error Resume Next
  Dim i As Integer
  Dim s, temp As String
  
  ' <IMG> tag elements
  For i = 0 To doc.images.length - 1
      ' remove file:///
      s = doc.images.Item(i).href
      temp = Left$(s, 8)
      If temp = "file:///" Then
         s = Right$(s, Len(s) - 8)
      End If
      
      Add list, s
  Next
  
  ' BACKGROUND image, if any
  s = doc.body.getAttribute("BACKGROUND")
  If Len(s) > 0 Then
      ' remove file:///
      temp = Left$(s, 8)
      If temp = "file:///" Then
         s = Right$(s, Len(s) - 8)
      End If

      Add list, s
  End If
End Sub

The source code above first scans the <IMG> collection and then checks for the background image. A background image is defined as an attribute of the body object. In particular, you need to verify the

s = doc.body.getAttribute("BACKGROUND")

BACKGROUND attribute. In this case, the name, complete with path and protocol, will be returned only if path and protocol are actually coded in the HTML source. The following is typical code for a background image.

<body background="yellow_grad.gif">
…
</body>

Enumerating the Objects

We don't have a built-in collection for the elements rendered through an <OBJECT> tag. So we must arrange a dynamic collection starting from the generic document.all collection.

Scriptlets are coded through the <OBJECT> tag, as well as ActiveX controls. To distinguish between them, we must use the classid and the data attributes. They are mutually exclusive, and both concur to identify the component uniquely. In the former case, classid refers to the object CLSID—that is a 128-bit identifier that points to a registry location for the actual file name. The data is the name of the HTML file implementing the Scriptlet.

Private Sub EnumerateObjects(ByVal doc As HTMLDocument, ByVal list As ListBox)
On Error Resume Next
  Dim i As Integer
  Dim s As String
  Dim obj As IHTMLElementCollection
  
  Set obj = doc.all.tags("OBJECT")
  
  ' <OBJECT> tag elements
  s = ""
  For i = 0 To obj.length - 1
      s = obj.Item(i).getAttribute("classid")
      If Len(s) = 0 Then
         s = obj.Item(i).getAttribute("data")
         s = App.Path + "\" + s
      Else
         ' remove "clsid:" and bracket between {}
         s = "{" + Right$(s, Len(s) - 6) + "}"
      End If
      
      If Len(s) > 0 Then
         Add list, s
      End If
  Next
End Sub

The next screenshot shows an example of how this DHTML X-Ray utility works. Once you've examined the content of the chosen HTML file, you get a list of file names. Notice a couple of things. First, the highlighted item—which is an example of how the above listing works. Secondly, the fact that the items include both slashes and backslashes as path name separators. While the use of backslashes is perfectly normal, slashes are a sort of inheritance from the Dynamic HTML object model collections.

Getting an ActiveX Control's File Name

While the name of the Scriptlet is clear and readable, any ActiveX control hosted in an HTML page is indirectly referenced through its unambiguous CLSID. We saw before how to extract the CLSID from an HTML <OBJECT> element. Now we must face the problem of recovering the actual OCX or DLL file name from the CLSID.

ActiveX controls are registered in special locations in the Windows registry. At least, this is what occurs under Win32-compliant platforms. Elsewhere, we have system objects that resemble and simulate the Win32 registry.

In Win32, therefore, ActiveX information is stored under the

HKEY_CLASSES_ROOT\
  CLSID\
    {…}\
      InProcServer32

key, where {…} denotes a string like the one highlighted in the previous screenshot. The same screenshot contains a copy of the calendar control, whose system information is stored under the path illustrated by the Registry Editor.

Once you have the CLSID as a string, it's relatively easy to read the actual library name from the registry by means of some Win32 API functions. (See the section Further Reading, later in this chapter.)

Enumerating the Links

An HTML file can contain links to a variety of sources. A link might point to an e-mail address, to a local file, a paragraph later in the same page, or to a remote site through several protocols. So enumerating all the links is easy in one way, because we just need to resort to the built-in collection document.links, but it is complex too, because we must distinguish between the different type of links.

In short, we're developing a tool that needs to know how many—and what type of—files are necessary to a given Scriptlet. For this purpose, we're only interested in local references. In particular, we'll be discarding all the links, except those that begin with file://.

Private Sub EnumerateLinks(ByVal doc As HTMLDocument, ByVal list As ListBox)
On Error Resume Next
  Dim i As Integer
  Dim s, temp As String
  
  ' <A> tag elements
  For i = 0 To doc.links.length - 1
      s = doc.links.Item(i).href

      ' remove file:///
      temp = Left$(s, 8)
      If temp = "file:///" Then
         s = Right$(s, Len(s) - 8)
      End If
      
      Add list, s
  Next
End Sub

The links we're primarily interested in for Scriptlets raise the problem of recursive search. In fact, links point to other HTML files that are needed to enable the component to function correctly. In addition, these linked HTML files have their own content, with other files that might be needed to enable them to work correctly, and so on.

The recursion takes place in the Add subroutine, which we discussed earlier. Each file the original HTML is linked to is opened into a hidden WebBrowser control. When loading completes, a DocumentComplete event is raised for this second browsing object.

Private Sub WebBrowser2_DocumentComplete(ByVal pDisp As Object, URL As Variant)
  DoWalkElements WebBrowser2.Document, List1
End Sub

As shown above, we call DoWalkElements again, but pass a different document object.