Frequently Asked Questions About XML

Updated: September 23, 1999

(See the XML Technical FAQ for additional information in Q&A format.)

Download this document in Microsoft Word (.DOC) format (zipped).

XML, the language

What is XML?
Does XML replace HTML?
What are the benefits of adding XML to HTML?
How does XML fit into the Microsoft® Windows® Distributed interNet Applications (Windows DNA) strategy for building three-tier, Web-enabled applications?
Where will XML be used on the Web?
Does Microsoft Internet Explorer 4 support XML?
What is the level of XML support in Internet Explorer 5?
What is the difference between SGML and XML?
How are HTML, Dynamic HTML, and XML related?
Will it be necessary to compress XML for transmission over the Web?
How will XML be generated from existing databases?
What is a DTD? What is it used for?
Do Web developers have to include a DTD when they use XML to describe data?
What are XML schemas? How are they different from DTDs?
What are namespaces? Why are they important?

Extensible Stylesheet Language (XSL)

What is XSL? What can you do with XSL today?

XML-Data

What is XML-Data?

Standards

What is the relationship between XML and the World Wide Web Consortium?
What is the status of XML with the W3C?
What is the status of DOM with the W3C?
Where does XSL stand in the W3C?

XML vocabularies and data formats

What are XML vocabularies?
What is CDF?
What is OSD?
What is OFX?
What is RDF?

Tool support

What tools support XML today?
Where will the tools come from in the future?

Issues and solutions

Why is my document object still empty after I call the Load() method?
How do I load a document with foreign and special characters?
How do I use MSXML COM components in Visual Studio 6.0 C++?
How do I use HTML Entities in my XML?
How is white space handled in element content?
How is white space handled for attributes?
How is white space handled in the XML object model?
What does the XML declaration do?
How do I print my XML document in a readable format?
How do I use namespaces in DTDs?
How do I use XMLDSO in Visual Basic?
How do I use the XML DOM with Java?

Extensible Markup Language (XML) is the universal language for data on the Web. It gives developers the power to deliver structured data from a wide variety of applications to the desktop for local computation and presentation. XML allows the creation of unique data formats for specific applications. It is also an ideal format for server-to-server transfer of structured data.

Does XML replace HTML?

Microsoft expects many authors and developers to use XML and HTML in tandem, for example by using XSL to generate HTML.

What are the benefits of adding XML to HTML?

There are many benefits to using XML on the Web:

It delivers data for local computation. Data delivered to the desktop is available for local computation. The data can be read by the XML parser, then delivered to a local application such as a browser for further viewing or processing. Or the data can be manipulated through script or other programming languages using the XML Object Model.
It gives users an appropriate view of structured data. Data delivered to the desktop can be presented in multiple ways. A local data set can be presented in the view that is right for the user, dynamically, based on factors such as user preference and configuration.
It enables the integration of structured data from multiple sources into common logical views. Typically, agents will be used to integrate data from server databases and other applications on a middle-tier server, making this data available for delivery to the desktop or to other servers for further aggregation, processing, and distribution.
It describes data from a wide variety of applications. Because XML is extensible, it can be used to describe data contained in a wide variety of applications, from describing collections of Web pages to data records. Because the data is self-describing, data can be received and processed without the need for a built-in description of the data.
It improves performance through granular updates. XML enables granular updating. Developers do not have to send the entire structured data set each time there is a change. With granular updating, only the changed element must be sent from the server to the client. The changed data can be presented without the need to refresh the entire page or table.

How does XML fit into the Microsoft Windows® Distributed interNet Applications (Windows DNA) strategy for building three-tier, Web-enabled applications?

XML is quickly becoming the vehicle for delivering structured data from the middle tier to the desktop. XML-based data can be integrated from multiple server (database) sources, using agents on the middle tier. Schemas (see the XML-Data section) can improve this process, as developers can describe and exchange data more precisely.

Where will XML be used on the Web?

Because XML describes data in a consistent, self-describing, open format, XML could potentially be used anywhere there is a need for data interchange and delivery. Microsoft expects that initially XML will be used to describe information about HTML pages, as is the case today with the channel definition format (CDF) for building Active Channel™ content, as well as future applications such as searching and distributed printing.

More important, because XML can describe data itself, it will be useful for delivering any kind of data, such as financial transactions, news updates, weather information, patient records, and legal libraries, to the desktop. Once on the desktop, applications can compute with the data and dynamically present the data.

Does Microsoft Internet Explorer 4 support XML?

Yes, Internet Explorer 4 supports XML. It supports the following features:

A generalized XML parser, which reads XML files and hands them off for processing to applications such as viewers. Microsoft has two parsers, the Microsoft XML parser in C++, a high-performance, non-validating parser written in C++ that ships with Internet Explorer 4.0, and the Microsoft XML parser in Java, available for download from this site, for use by application developers.
The XML Object Model (XML OM) uses the W3C standard Document Object Model (DOM) to allow programmatic access to the structured data, through the XML parsers, giving developers the power to interact and compute on the data. For more information on the DOM, see http://www.w3.org/DOM/ .
The XML Data Source Object (XML DSO) allows developers to connect to structured XML data and supply it to the HTML page using Dynamic HTML's data binding facility.

What is the level of XML support in Microsoft® Internet Explorer 5?

Internet Explorer 5 has the following XML support:

Direct viewing of XML. The Microsoft XML implementation lets users view XML using XSL or Cascading Style Sheets (CSS) with their Web browser, just as they view HTML documents.
High-performance, validating XML engine. The XML engine familiar to Internet Explorer 4 developers has been substantially enhanced and fully supports W3C XML 1.0 and XML namespaces, which let developers qualify element names uniquely on the Web and thus avoid conflicts between elements with the same name. Native XML support for Windows users means that developers can count on the full XML processing capabilities being present to read and manipulate the data they move between their applications and components.
Extensible Style Language (XSL) support. With the Microsoft XSL processor, which is based on the latest W3C Working Draft, developers can apply style sheets to XML data and display the data in a dynamic and flexible way that can be easily customized. The querying capabilities of the Microsoft XSL processor also allow developers to programmatically find and extract information within an XML data set on the client or the server.
XML Schemas. Schemas define the rules of an XML document, including element names and rich data types, which elements can appear in combination, and which attributes are available for each element. In order to enable multi-tier applications, Microsoft will be releasing a technology preview for XML Schema based on the Schema submissions to the W3C XML working group.
Server-side XML. Server-side XML processing allows XML to be used as a standard means of passing data between multiple distributed application servers -- even across operating system boundaries.
XML Document Object Model (DOM). The DOM is a standard object application programming interface that gives developers programmatic control of XML document content, structure, formats, and more. The Microsoft XML implementation includes full support for the W3C XML DOM recommendation and is accessible from script, the Visual Basic development system, C++, and other languages.
C++ XML Data Source Object. This XML DSO allows you to bind HTML elements directly to an XML data island. In addition, it has increased performance, has a greater ability to bind to various XML nodes, and takes advantage of all the new data binding features within Microsoft® Internet Explorer 5.

What is the difference between SGML and XML?

The Standard Generalized Markup Language, or SGML (ISO 8879), is the international standard for defining descriptions of structure and content in electronic documents. XML is a simplified version of SGML; XML was designed to maintain the most useful parts of SGML. While SGML requires that structured documents reference a document type definition (DTD) to be valid, XML allows for "well-formed" data and can be delivered without a DTD. XML was designed so that SGML can be delivered, as XML, over the Web.

How are HTML, Dynamic HTML, and XML related?

HTML is used in conjunction with CSS to format and present hyperlinked pages. Dynamic HTML, through the Document Object Model, makes all elements in HTML accessible through language-independent scripting and other programming languages, thus dramatically increasing client-side interactivity without additional requests to the server. The page's object model allows any aspect of its content (including additions, deletions, and movement) to be changed dynamically.

By adding XML for structured data, developers have the technologies they need to build the next generation of rich, flexible Web applications. With XML, they can deliver structured data to the desktop and compute on the data via the XML Object Model. Today developers can display XML-based data in a browser, such as Microsoft Internet Explorer 4.0 and Microsoft Internet Explorer 5, or in other applications through scripting. In addition, they can also apply formatting rules to the data without complex scripting using XSL style sheets, which essentially transform the XML-based data into display. These two methods of displaying XML-based data make it possible to generate multiple views of complex data.

Will it be necessary to compress XML for transmission over the Web?

In general, the need to compress XML data will be application-dependent and largely a function of the amount of data being moved between the server and the client. XML compresses extremely well because of the repetitive nature of the tags used to describe the structure of the data. Benchmarks will be provided in the future to assist in determining whether compression is necessary. It is worth noting that compression is standard to HTTP 1.1 servers and clients, and XML will automatically benefit from this.

How will XML be generated from existing databases?

In general, this will be handled using a three-tier architecture. Agents will be built to run on the middle tier to access multiple existing database management systems (DBMSs) and output XML. XML enables the generation of common logical views on these databases. These agents will also support the ability to generate XML "updategrams" bidirectionally, that is, to inform the client of changes made to the data on the middle tier or database server, and vice versa. Consequently, the agents will be able to receive updategrams from the client and send updates to the DBMS.

What is a DTD? What is it used for?

The document type definition (DTD) defines the valid syntax of a class of XML documents. That is, it lists a number of element names, which elements can appear in combination with which other ones, what attributes are available for each element type, and so on. A DTD uses a different syntax from that used by XML documents.

Do Web developers have to include a DTD when they use XML to describe data?

No. XML can be used to describe data with or without a DTD. The term "valid" XML refers to XML data that references a DTD, while "well-formed" XML refers to XML that does not use a DTD. The addition of well-formed XML is one of the fundamental differences between XML and SGML. Clearly, in both cases, the XML itself must conform to the standards of the language (so, for example, all tags must be closed and tags may not overlap).

What are XML schemas? How are they different from DTDs?

[From the W3C XML Activity Page at http://www.w3.org/XML/Activity.html ]

While XML 1.0 supplies a mechanism, the Document Type Definition (DTD), for declaring constraints on the use of markup, automated processing of XML documents requires more rigorous and comprehensive facilities in this area. Requirements are for constraints on how the component parts of an application fit together, the document structure, attributes, datatyping, and so on. The W3C XML Schema Working Group is addressing means for defining the structure, content and semantics of XML documents.

In Internet Explorer 5, Microsoft is providing a release of XML Schema as a technology preview that may be useful for developers interested in building prototypes and gaining experience with schema. This technology preview is based on the XML-Data note submitted to the W3C. XML Schema, as implemented in this technology preview, can be thought of as the subset of the XML-Data submission that corresponds to the feature set proposed for Document Content Description (DCD) . Microsoft is actively involved in defining the emerging W3C XML schema standard and will track this effort. Developers should note that the version of XML Schema released with Internet Explorer 5 is subject to change.

What are namespaces? Why are they important?

The namespace facility is another advanced feature of XML, outlined in a W3C Working Draft. Namespaces allow developers to qualify uniquely the element names and relationships and to make these names recognizable. By doing so, they can avoid name collisions on elements that have the same name but are defined in different vocabularies. They allow tags from multiple name spaces to be mixed, which is essential if data is coming from multiple sources.

For example, a bookstore may define the <TITLE> tag as the title of a book, contained only within the <BOOK> element. A directory of people, however, might define <TITLE> as a person's position. Consider, for instance, <TITLE>President</TITLE>. Namespaces help define this distinction clearly.

Extensible Stylesheet Language (XSL)

What is XSL? What can you do with XSL today?

The W3C Working Draft for XSL divides the language into two main parts: transformation and formatting semantics. This release supports the transformation part of the W3C XSL specification . Microsoft is tracking the W3C Working Draft and will be updating this implementation to match the final W3C recommendation.

XSL is defined as an XML grammar that consists of a set of XSL elements. This grammar can be used to transform XML documents into HTML or XML documents.

You can use XSL for direct browsing of XML files and from the XML DOM. The XML DOM transformNode method supports the use of XSL Elements to perform transformations. The DOM selectNodes and selectSingleNode methods support the XSL pattern-matching syntax that enables sophisticated queries for nodes within a particular context of the overall tree structure.

XML-Data

What is XML-Data?

XML-Data, a specification that has been submitted to the W3C for review, makes XML even more powerful and extensible. It outlines a richer method of describing and validating data, making XML even more powerful for integrating data from multiple disparate sources and building three-tier Web applications.

In January 1998, the W3C acknowledged the Extensible Markup Language XML-Data submission from Microsoft, ArborText Inc., DataChannel Inc., and Inso Corp. The specification is available for public review at http://www.w3.org/TR/1998/NOTE-XML-data-0105/ or http://msdn.microsoft.com/standards/.

Standards

What is the relationship between XML and the World Wide Web Consortium?

The W3C has an active XML Working Group. Microsoft was one of the co-founders of this group in June 1996, and since then numerous industry players have joined, including Netscape Communications Corp, IBM and Oracle. For more information on the XML standards process, see http://www.w3.org/ .

What is the status of XML with the W3C?

XML version 1.0 recently moved from the proposed recommendation phase to the recommendation phase, which is the last step in the approval process at the W3C, and is a very stable standard. For more information on the current XML specification, and on the submission and review process within the W3C, see http://www.w3.org/ .

What is the status of DOM with the W3C?

The XML DOM recently moved from the proposed recommendation phase to the recommendation phase, which is the last step in the approval process at the W3C, and is a very stable standard. For more information on the current DOM specification, and on the submission and review process within the W3C, see http://www.w3.org/ .

Where does XSL stand in the W3C?

XSL is currently in the Working Draft stage in the W3C. It was submitted by ArborText, Inso, and Microsoft in September 1997. Microsoft plans to update its XSL code to track changes as it moves forward in the standard-development process.

XML Vocabularies and Data Formats

What are XML vocabularies?

XML vocabularies are the elements used in particular applications or data formats, the definitions of those formats. For example, in Channel Definition Format (CDF), element names such as <Schedule>, <Channel>, and <Item> make up the vocabulary for describing collections of pages, when these pages should be downloaded, and so on. Vocabularies, along with the structural relationships between the elements, are defined in XML DTDs and XML-Data schemas.

What is CDF?

The channel definition format (CDF) is an XML-based data format used in Microsoft Internet Explorer 4.0, for describing Active Channel content and Active Desktop™ components. It is used by thousands of content developers and millions of end users to describe collections of pages and data about pages, such as channel bar display, download behavior, Web page usage, and page-hit logging. For more information on CDF, see the Content & Component Delivery section of the Web Workshop.

What is OSD?

The open software description (OSD) is an XML-based data format, fully supported in Microsoft Internet Explorer 4.01, for advertising and installing software components over the Internet. When new versions of software become available, OSD provides a mechanism to notify the user (a process referred to as publishing). In addition, OSD provides the functionality to describe in great detail how to install ActiveX® Controls, as well as Java packages and class files, adding functionality to the use of .INFs for setup. Microsoft and Marimba Inc. submitted this specification to the W3C in August 1997. For more information, see http://msdn.microsoft.com/standards/.

What is OFX?

The open financial exchange (OFX) is a data format that Microsoft Money and Intuit Quicken personal finance applications use to communicate with financial institutions over the Web. Although it is currently described using SGML, OFX will soon be based on XML.

What is RDF?

The resource description framework (RDF), is an XML-based application being developed under the direction of W3C. It brings together ideas from the meta content format, or MCF (technology acquired by Netscape from Apple Computer Inc.) and XML-Data (defined in a proposal recently submitted to the W3C by Microsoft, ArborText, DataChannel, and Inso).

RDF allows for generalized searching of information without application-specific rules, such as those defined in DTDs. RDF allows a complementary view of data through graphs and nodes, rather than through a structured tree, which the current XML technology enables. RDF, together with XML schemas, will provide a standard way for developers to write these relationships for broad classes of XML elements.

The crucial technologies that will deliver value this year and next are XML for structured data, XML namespaces to make names unique and recognizable, and new XML tags that add meaning to data, so smarter search engines can perform better searches.

Tool Support

What tools support XML today?

Many vendors offer support for XML in their products today. See the XML Tools page for a listing of the top third-party vendors.

Where will the tools come from in the future?

Microsoft expects a wide variety of applications to be developed in the coming months that convert information currently stored in documents and databases into XML for delivery to the desktop. In addition, Microsoft expects XML-centric databases, rich authoring and application developer tools, and data format-specific tools such as wizards to be developed as new vocabularies are defined.

Issues and solutions

Why is my document object still empty after I call the Load() method?

By default, operations are loaded asynchronously. This means that if you provide an http URL location, the load() method will return immediately and your document object will still be empty because the data hasn't come back from the server yet. To fix this, add the following line to your code:

xmldoc.async = false;

Also, if you are loading http XML documents from a standalone C++ application, you will have to query the message queue in order to continue downloading.

How do I load a document with foreign and special characters?

A document may contain foreign characters such as the following:

<test>foreign characters (úóíá) </test>

Foreign characters such as úóíá must be prefaced with an escape sequence. Foreign characters can be either UTF-8 encoded or specified with a different encoding as follows:

<?xml version="1.0" encoding="iso-8859-1"?>
<test>foreign characters (úóíá) </test>

Now your XML will load correctly.

Other characters are reserved in XML and also need to be handled differently. The following XML:

<foo>This & that</foo>

generates this error:

Whitespace is not allowed at this location.
Line 0000001: <foo>This & that</foo>
Pos  0000012: ----------^

The ampersand is part of the syntactic structure of XML and will not be interpreted as an ampersand if simply placed within an XML data source. You need to substitute a special character sequence called an "entity".

<foo>This & that</foo>

The following characters require the corresponding entities:

<&lt;
&&amp;
>&gt;
"&quot;
'&apos;

Quote characters are used as delimiters for attribute values inside a tag, and therefore cannot always be used inside the value of an attribute. For example, the following will return an error:

<foo description='John's Stuff'>

The single quote is used both as an attribute delimiter and in the attribute value itself. To fix this, you can either switch to use a double quote for the attribute delimiter as follows:

<foo description="John's Stuff">

Or you can escape the single quote to the entity '

<foo description='John' Stuff'>

Both of the above will return the attribute value John's Stuff via the getAttribute method in the XML object model. Similarly for the double quote, you can use the entity

&quot;.

You can also handle special characters in element content by putting your text inside a CDATA section. The following is valid:

<xml>
  <![CDATA[ This & that <stuff> is just "text" content. ]]>
</xml>

In this example, the XML Object Model will show a CDATA node as a child of the xml node which will return the string

This & that <stuff> is just "text" content.

as the nodeValue.

How do I use MSXML COM components in Visual C++ 6.0?

The easiest way to use MSXML COM components in Visual C++ 6.0 is to use the #import directive:

#import "msxml.dll" named_guids no_namespace

This defines all the IXML* interfaces and interface IDs so you can use them in your application. You can also get the MSXML type libraries and header files, and the uuid.lib that contains the class IIDs from the INETSDK.

How do I use HTML Entities in my XML?

The following XML contains an HTML entity:

<copyright>Copyright © 1999, Microsoft Inc, All rights reserved.</copyright>

It generates the following error:

Reference to undefined entity 'copy'. 
Line: 1, Position: 23, ErrorCode: 0xC00CE002
<copyright>Copyright © 1999, ...
----------------------^

This is because XML has only five built-in entities. See How do I load a document with special characters? for more information about built-in entities.

To use HTML entities, you need to define them with a DTD. To find out more about DTDs, see the W3C XML Recommendation. To use this DTD, include it directly in a DOCTYPE tag as follows:

<!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd">
<copyright>Copyright © 1999, Microsoft Inc, All rights reserved.</copyright>

For this to load, you need to turn off the validateOnParse property of the IXMLDOMDocument interface. Try pasting this into the Validator Test Page, turn off DTD validation, and click Validate. Notice that the document loads and the copyright character is available in the DOM tree shown at the end of the validator page.

If you are already doing DTD validation, then you must include the HTML entities as a parameter entity in your existing DTD as follows:

<!ENTITY % HTMLENT SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd">
%HTMLENT;

This will define all the HTML entities so you can use them in your XML document.

How is white space handled in element content?

The XML DOM has three methods for accessing the text content of elements:

Property	Behavior
nodeValue	Returns the original text content (including white space) on TEXT, CDATA, COMMENT, and PI nodes as specified in the original XML source. Returns null on ELEMENT nodes and on the DOCUMENT itself.
data	Same as nodeValue
text	Recursively concatenates multiple TEXT and CDATA nodes in a specified subtree and returns the combined result.

Note: White space consists of newline, tab, and space characters.

The nodeValue property always returns what is in the original document independent of how the document is loaded and current xml:space scope.

The text property concatenates all text in the specified subtree and expands entities. This is dependant upon how the document is loaded, the current state of the preserveWhiteSpace switch, and the current xml:space scope, as follows:

preserveWhiteSpace = true when the document is loaded

preserveWhiteSpace=true	preserveWhiteSpace=true	preserveWhiteSpace=false	preserveWhiteSpace=false
xml:space=preserve	xml:space=default	xml:space=preserve	xml:space=default
preserved	preserved	preserved	preserved and trimmed

preserveWhiteSpace = false when the document is loaded

preserveWhiteSpace=true preserveWhiteSpace=true preserveWhiteSpace=false preserveWhiteSpace=false

xml:space=preserve xml:space=default xml:space=preserve xml:space=default

half preserved half preserved and trimmed half preserved half preserved and trimmed

Where preserved means the exact original text content as found in the original XML document, trimmed means the leading and trailing spaces have been removed, and half preserved means that "significant white space" is preserved and "insignificant white space" is normalized. Significant white space is white space inside of text content. Insignificant white space is white space between tags as follows:

<name>\n
\t<first>    Jane</first>\n
\t<last>Smith     </last>\n
</name>

In this example, the red is insignificant white space and can be ignored, while the green is significant white space since it is part of the text content and therefore has a significant meaning and cannot be ignored. So in this example, the text property returns the following results:

state	returned value
preserved	"\n\t Jane\n\tSmith \n"
preserved and trimmed	"Jane\n\tSmith"
half preserved	" Jane Smith "
half preserved and trimmed	"Jane Smith"

Notice that "half preserved" normalizes insignificant white space, for example, the newlines and tab characters are collapsed down into a single space character. You can change the xml:space attributes and the preserveWhiteSpace switch and the text property will return a different value accordingly.

CDATA and xml:space="preserve" subtree boundaries

In the following example, the contents of the CDATA node or the "preserved" node are concatenated as they are and do not participate in the insignificant white space normalization. For example:

<name>\n
\t<first> Jane </first>\n
\t<last><![CDATA[     Smith     ]></last>\n
</name>

In this case, the white space inside the CDATA node is never "merged" with "insignificant" white space and is never trimmed. Therefore, the "half preserved and trimmed" case will return the following:

"Jane      Smith     "

Here, the insignificant white space between the </first> and <last> tags is included regardless of the contents of the CDATA node. The same result is returned if the CDATA is replaced with the following:

<last xml:space="preserve">     Smith     </last>

Entities are special

Entities are loaded and parsed as part of the DTD and appear under the DOCTYPE node. They do not necessarily have any xml:space scope. For example:

<!DOCTYPE foo [
<!ENTITY Jane "<employee>\n
\t<name> Jane </name>\n
\t<title>Software Design Engineer</title>\n
</employee>">
]>
<foo xml:space="preserve">&Jane;</foo>

Assuming that preserveWhiteSpace=false (in the scope of the DOCTYPE tag), the insignificant white space is lost when the entity is parsed. The entity will not have white space nodes. The tree will look like this:

DOCTYPE foo
    ENTITY: Jane
        ELEMENT: employee
            ELEMENT: name
                TEXT: Jane 
            ELEMENT: title
                TEXT>:Software Design Engineer
    ELEMENT: foo
       ATTRIBUTE: xml:space="preserve"
       ENTITYREF: Jane

Notice that the DOM tree exposed under the ENTITY node inside the DOCTYPE does not contain any WHITESPACE nodes. This means that the children of the ENTITYREF node will also have no WHITESPACE nodes even though the entity reference is in the scope of xml:space="preserve".

Every instance of an ENTITY referenced in a given document always has the identical tree.

If an entity absolutely must preserve white space, then it must specify its own xml:space attribute inside itself or the document preserveWhiteSpace switch must be set to true.

How is white space handled for attributes?

There are several ways of accessing an attribute value. The IXMLDOMAttribute interface has a nodeValue property, which is equal to nodeValue and a text property which is the Microsoft extension. These properties return the following:

property text returned

attrNode.nodeValue
attrNode.value
getAttribute("name") Returns exact content (with entities expanded) as found in the original document.

attrNode.nodeTypedValue Null

attrNode.text Same as nodeValue except the leading and trailing white space is trimmed.

The XML Language specification defines the following behavior for XML Applications:

Attribute type Text returned

CDATA ID, IDREF, IDREFS, ENTITY, ENTITIES, NOTATION, enumeration

half normalized fully normalized

Where half normalized means that newlines and tab characters are converted to spaces, but multiple spaces are not collapsed into one space.

How is white space handled in the XML object model?

Sometimes the XML Object Model will show TEXT nodes containing white space characters. This can be confusing when most of the time white space is stripped. For example the following XML example:

<?xml version="1.0" ?>
<!DOCTYPE person [
  <!ELEMENT person (#PCDATA|lastname|firstname)>
  <!ELEMENT lastname (#PCDATA)>
  <!ELEMENT firstname (#PCDATA)>
]>
<person>
  <lastname>Smith</lastname>
  <firstname>John</firstname>
</person>

Generates the following tree:

Processing Instruction: xml
DocType: person
ELEMENT: person
TEXT: 
ELEMENT: lastname
TEXT: 
ELEMENT: firstname
TEXT:

The first name and last name are surrounded by TEXT nodes containing only white space because the content model for the "person" element is MIXED; it contains the #PCDATA keyword. A MIXED content model indicates that the elements can have text interspersed between them. Therefore, the following is also valid:

<person>
My last name is <lastname>Smith</lastname> and my first name is
<firstname>John</firstname>
</person>

And this results in the following similar looking tree:

ELEMENT: person
TEXT: My last name is
ELEMENT: lastname
TEXT: and my first name is
ELEMENT: firstname
TEXT:

Without the white space after the word "is" and before <lastname>, and the white space after the </lastname> and before the word "and", the sentence would be unintelligible. So, for MIXED content models, the combination of text, white space, and elements is relevant. For non-MIXED content models this is not the case.

To make the white-space-only TEXT nodes go away, remove the #PCDATA keyword from the "person" element declaration:

<!ELEMENT person (lastname,firstname)>

which results in the following clean tree:

Processing Instruction: xml
DocType: person
ELEMENT: person
ELEMENT: lastname
ELEMENT: firstname

What does the XML declaration do?

The XML declaration must be listed at the top of the XML document:

<?xml version="1.0" encoding="utf-8"?>

It specifies the following items:

The document is an XML document. This can be used by MIME sniffers to detect that a file is of type text/xml when the MIME type has been lost or has not been specified.
The document follows the XML 1.0 specification. This will be important in the future when XML has other versions.
The document character encoding. The encoding attribute is optional and defaults to UTF-8.

Note: The XML declaration must be the first line in an XML document, so the following XML file:

<!--HEADLINE="Dow closes as techs get hammered"-->
<?xml version="1.0"?>

generates the following parse error:

Invalid xml declaration.
Line 0000002:     <?xml version="1.0"?>
Pos  0000007: ------^

Note: The XML declaration is optional. If you need to specify a comment or processing instruction at the top, then don't put the XML declaration in at all. However, the encoding will be UTF-8, the default.

How do I print my XML document in a readable format?

When generating an XML file by building a document from scratch using the DOM, everything is on a single line with no whitepace in between. This is the default behavior.

The default XSL style sheet built into Internet Explorer 5 displays and prints XML documents in a readable format. For example, if you have IE5 installed, try viewing the nospace.xml file. You should see the following tree display in your browser:

- <ORDER>
 - <ITEM NAME="123">
    <NAME>XYZ</NAME> 
    <PRICE>12.56</PRICE> 
   </ITEM> 
  </ORDER>

No white space is inserted into the XML.

Printing readable XML is quite tricky, especially when you have a DTD that defines different kinds of content models. For example in the mixed content model (#PCDATA), you may not want to insert spaces because this may change the meaning of the content. For example, consider the following XML:

<B>E</B><I>lephant</I>

This better not be output as:

<B>E</B>
<I>lephant</I>

because then the word boundaries are no longer correct.

All this makes automatic printing problematic. If you do need to print readable XML, you can use the DOM to insert white space as text nodes in the appropriate places.

How do I use namespaces in DTDs?

To use a namespace in a DTD, declare it in the ATTLIST declaration of the element that uses it, as follows:

<!ELEMENT x:customer ANY >
<!ATTLIST x:customer xmlns:x CDATA #FIXED "urn:...">

The namespace has to be of type #FIXED. Namespaces on attributes work the same way:

<!ELEMENT customer ANY >
<!ATTLIST customer
          x:value CDATA #IMPLIED
          xmlns:x CDATA #FIXED "urn:...">

Namespaces and XML Schemas

DTD's and XML Schemas cannot be mixed. For example, the following

xmlns:x CDATA #FIXED "x-schema:myschema.xml"

will not result in the use of schema definitions defined in myschema.xml. The use of DTDs and XML Schemas is mutually exclusive.

How do I use XMLDSO in Visual Basic?

Using the following XML as an example:

<contacts>
 <person>
  <name>Mark Hanson</name> 
  <telephone>206 765 4583</telephone> 
 </person>
 <person>
  <name>Jane Smith</name> 
  <telephone>425 808 1111</telephone> 
 </person>
</contacts>

You can bind to an ADO Recordset as follows:

Create a new VB 6.0 project.
Add references to Microsoft ActiveX Data Objects 2.1 or later, the Microsoft Data Adapter Library, and Microsoft XML, version 2.0.

Load the XML data into an XML DSO control using the following code:

Dim dso As New XMLDSOControl
Dim doc As IXMLDOMDocument
Set doc = dso.XMLDocument
doc.Load ("d:\test.xml")

Map the DSO into a new Recordset object using a DataAdapter with the following code:

Dim da As New DataAdapter
Set da.Object = dso
Dim rs As New ADODB.Recordset
Set rs.DataSource = da

Access the data:
```
MsgBox rs.Fields("name").Value
```
This displays the string "Mark Hanson"

How do I use the XML DOM with Java?

The IE5 version of MSXML.DLL must have already been installed. In Visual J++ 6.0, from the Project menu, select Add COM Wrapper, and choose "Microsoft XML 1.0" from the list of COM objects. This builds the required Java wrappers into a new package called "msxml". These pre-built Java wrappers are also available for download. The classes can be used as follows:

import com.ms.com.*;
import msxml.*;

public class Class1
{
  public static void main (String[] args)
  {
    DOMDocument doc = new DOMDocument();
    doc.load(new Variant("file://d:/samples/ot.xml"));
    System.out.println("Loaded " + doc.getDocumentElement().getNodeName());
  }
}

The code sample loads a 3.8 MB test file "ot.xml" from the sun religion example. The Variant class is used for wrapping the Win32 VARIANT primitive type.

You cannot use pointer comparisons on the nodes since each time you retrieve a node you actually get a new wrapper. So, rather than using the following code,

IXMLDOMNode root1 = doc.getDocumentElement();
IXMLDOMNode root2 = doc.getDocumentElement();
if (root1 == root2)...

use the following instead:

if (ComLib.isEqualUnknown(root1, root2)) ....

The total size of the .class wrappers is about 160 KB. However, to be fully compliant with the W3C specification, you should use only the IXMLDOM* wrappers. The following classes are old IE 4.0 XML interfaces and can be deleted from the msxml folder:

IXMLAttribute*,
IXMLDocument*, XMLDocument*
IXMLElement*,
IXMLError*,
IXMLElementCollection*,
tagXMLEMEM_TYPE*
_xml_error*

This brings the size down to 147 KB. You may also want to delete the following additional items:

DOMFreeThreadedDocument
Accesses the XML document from multiple threads in a Java application.
XMLHttpRequest
Communicates with servers using XML DAV HTTP extensions.
IXTLRuntime
Defines the XSL style sheet scripting object.
XMLDSOControl
Binds to XML data in an HTML page.
XMLDOMDocumentEvents
Returns callbacks during parsing.

This brings the size down to 116 KB. To get it even smaller, consider the fact that the DOM itself comes in two layers: a core layer consisting of:

DOMDocument, IXMLDOMDocument
IXMLDOMNode*
IXMLDOMNodeList*
IXMLDOMNamedNodeMap*
IXMLDOMDocumentFragment*
IXMLDOMImplementation
IXMLDOMParseError

and DTD information that you probably want to keep:

IXMLDOMDocumentType
IXMLDOMEntity
IXMLDOMNotation

All nodes in an XML document are of type IXMLDOMNode, which provides complete functionality, but higher level wrappers exist for each node type. Therefore, all the following interfaces can also be deleted if you modify the DOMDocument wrapper and change these specific types to use IXMLDOMNode instead:

IXMLDOMAttribute
IXMLDOMCDATASection
IXMLDOMCharacterData
IXMLDOMComment
IXMLDOMElement
IXMLDOMProcessingInstruction
IXMLDOMEntityReference
IXMLDOMText

Deleting these brings the size down to 61 KB. However, with IXMLDOMElement, the getAttribute and setAttribute methods are useful. Otherwise, you will need to use:

IXMLDOMNode.getAttributes().setNamedItem(...)

property	text returned
attrNode.nodeValue attrNode.value getAttribute("name")	Returns exact content (with entities expanded) as found in the original document.
attrNode.nodeTypedValue	Null
attrNode.text	Same as nodeValue except the leading and trailing white space is trimmed.

Attribute type	Text returned
CDATA	ID, IDREF, IDREFS, ENTITY, ENTITIES, NOTATION, enumeration
half normalized	fully normalized