Tips for XML


Use nodeFromID, rather than selectSingleNode

Although you can search through an XML document for a specific entry using the selectSingleNode function, you will be sacrificing performance to do so. In addition to a (relatively slow) linear search through the data, selectSingleNode must also parse the specified search criteria. It is much more efficient to search for a single unique entry—and avoid the redundant parsing and linear search steps—by using nodeFromID, a Microsoft extension to the XML DOM (as proposed by the W3C) that enables fast lookups using an ID attribute value and a hash table.

In order for the XML parser to recognize ID attributes, you must use a DTD or schema for your document. There are a few examples of this in Scenario 3. The following lines are taken from the lingoSchema.xml file:

<Schema name="LingoSchema" xmlns="urn:schemas-microsoft-com:xml-data" 
        xmlns:dt="urn:schemas-microsoft-com:datatypes">
   <AttributeType name="LID" dt:type="id" required="yes" />

In this example, the LID attribute has been defined with a datatype of ID, meaning that the LID attribute can be used to quickly look up individual elements in the XML source document with the nodeFromID function. Note that ID attribute values must be unique within their XML documents (no element can specify more than one ID attribute) and they must follow the convention for ID values (namely, the ID string must begin with a letter or underscore).

Starting in Windows 2000, an element can be of type "id" as shown in the following code:

<sample xmlns:dt="urn:schemas-microsoft-com:datatypes">
   <elem dt:dt="id">sampleid</elem>
</sample>

In the preceding example, the elem element has a type of "id", meaning that the text value of the elem element can be used to reference the sample element. Once an element has a type of "id", the parent of that element can be referenced by the value of the id element. For example, if you passed the string "sampleid" to the nodeFromID method, that method would return the sample element.

xmldoc.nodeFromID("sampleid")


Avoid Recursion by Using a Node List

As part of the build process, the Litware.Lingo component needs to search the HTML source files for LID attributes. Previously we passed the documentElement of the globalized XML template file to the parse routine, which in turn traversed the entire tree looking for LID attributes. The routine used a recursive call to search child nodes. The following code fragment shows how we used to do this translation—by making a call to GetLingoValues to merge the XML template file with lingo elements:

   Call GetLingoValues(gXML.documentElement)

Private Sub GetLingoValues(ByRef elem As IXMLDOMNode)
   Dim item as IXMLDOmNode
   If (item.nodeTypeString = "element") Then
      If item.Attributes.length > 0 Then
         'get attributes from lingo file and append
         Call GetNodeValue(item)
      End If
      For Each item In elem.childNodes
         'calling recursively.
         Call GetLingoValues(item)
      Next
   End If
End Sub

Although this method works, it has to search the entire XML hierarchy for elements to merge. This can be slow, especially since the search is being performed through Automation interfaces. We learned that instead of passing the root documentElement, we could let the XML parser provide the collection of nodes that contain a LID attribute. Using this approach, we start with a collection of nodes that match the search criteria. Moreover, the search can be performed much more efficiently by the XML parser using optimized internal routines.

The following code fragment has been modified to use a node list:

   'pass collection of nodes with expando property LID
   Call GetLingoValues(gXML.selectNodes("//*[@LID]"))

Private Sub GetLingoValues(ByRef colls As IXMLDOMNodeList)
   Dim item As IXMLDOMNode
   For Each item In colls
      If (item.nodeTypeString = "element") Then
         'get attributes from lingo file and append
         Call GetNodeValue(item)
      End If
   Next
End Sub


Changing XML Character Encoding

During the build process, the Litware.Lingo component combines text stored in two separate XML source files. Because the resulting file can contain Unicode characters, we had to be careful to save the results in the proper character encoding. At first, when the character data was copied from the lingo file, it was garbled as it was written to the resulting HTML file. We tried several approaches before hitting upon the "right" solution.

First we thought we needed to remove the Unicode characters from the lingo file and represent them with escape sequences. We later found out that XML should be able to read and write Unicode characters, so this was not the problem. Then we decided that there was a problem with how the characters were represented in the HTML file. We briefly experimented with converting extended characters into their corresponding entity representation—for example, the Japanese character represented by hexadecimal 3042 would be converted to "&#x3042;" in the final HTML. However, we quickly discovered that we could not write these strings to the XMLDocument object because it would reinterpret the ampersands (&) as &amp; and all the entities would appear as plain text in the HTML.

So, we contacted the program manager of the XML team at Microsoft and asked him if he could help us solve our problem. He reminded us that we could control the encoding used by XMLDocument.save by adding a processing instruction to the xml declaration to the top of the file. For example, to programmatically change the character encoding of the file to UTF-16, we could use the following:

Dim pi As IXMLDOMProcessingInstruction
Set pi = xmldoc.createProcessingInstruction("xml", "version='1.0' encoding='UTF-16'")
xmldoc.insertBefore(pi, xmldoc.firstChild)

We tried it and it worked! You can see how we used createProcessingInstruction in the private GetCharsetValue subroutine of the Litware.Lingo component. (See Lingo.cls.)