White space handling

Unlike HTML, which, in most cases, ignores white space (spaces, tabs, new lines, and so on), XML is for data, and thus has the capability through the reserved xml:space attribute to retain all white space. However, since white space handling in XML is actually not that simple, the following provides additional helpful information and examples.

White space in element content

The XML DOM has three methods for accessing the text content of elements:

Property	Behavior
nodeValue	Returns the original text content (including white space) on TEXT, CDATA, COMMENT, and PI nodes as specified in the original XML source. Returns null on ELEMENT nodes and on the DOCUMENT itself.
data	Same as nodeValue
text	Recursively concatenates multiple TEXT and CDATA nodes in a specified subtree and returns the combined result.

Note: White space consists of newline, tab, and space characters.

The nodeValue property always returns what is in the original document independent of how the document is loaded and current xml:space scope.

The text property concatenates all text in the specified subtree and expands entities. This is dependant upon how the document is loaded, the current state of the preserveWhiteSpace switch, and the current xml:space scope, as follows:

preserveWhiteSpace = true when the document is loaded

preserveWhiteSpace=true	preserveWhiteSpace=true	preserveWhiteSpace=false	preserveWhiteSpace=false
xml:space=preserve	xml:space=default	xml:space=preserve	xml:space=default
preserved	preserved	preserved	preserved and trimmed

preserveWhiteSpace = false when the document is loaded

preserveWhiteSpace=true preserveWhiteSpace=true preserveWhiteSpace=false preserveWhiteSpace=false

xml:space=preserve xml:space=default xml:space=preserve xml:space=default

half preserved half preserved and trimmed half preserved half preserved and trimmed

Where preserved means the exact original text content as found in the original XML document, trimmed means the leading and trailing spaces have been removed, and half preserved means that "significant white space" is preserved and "insignificant white space" is normalized. Significant white space is white space inside of text content. Insignificant white space is white space between tags as follows:

<name>\n
\t<first>    Jane</first>\n
\t<last>Smith     </last>\n
</name>

In this example, the red is insignificant white space and can be ignored, while the green is significant white space since it is part of the text content and therefore has a significant meaning and cannot be ignored. So in this example, the text property returns the following results:

state	returned value
preserved	"\n\t Jane\n\tSmith \n"
preserved and trimmed	"Jane\n\tSmith"
half preserved	" Jane Smith "
half preserved and trimmed	"Jane Smith"

Notice that "half preserved" normalizes insignificant white space, for example, the newlines and tab characters are collapsed down into a single space character. You can change the xml:space attributes and the preserveWhiteSpace switch and the text property will return a different value accordingly.

CDATA and xml:space="preserve" subtree boundaries

In the following example, the contents of the CDATA node or the "preserved" node are concatenated as they are and do not participate in the insignificant white space normalization. For example:

<name>\n
\t<first> Jane </first>\n
\t<last><![CDATA[     Smith     ]></last>\n
</name>

In this case, the white space inside the CDATA node is never "merged" with "insignificant" white space and is never trimmed. Therefore, the "half preserved and trimmed" case will return the following:

"Jane      Smith     "

Here, the insignificant white space between the </first> and <last> tags is included independent of the contents of the CDATA node. The same result is returned if the CDATA is replaced with the following:

<last xml:space="preserve">     Smith     </last>

Entities are special

Entities are loaded and parsed as part of the DTD and appear under the DOCTYPE node. They do not necessarily have any xml:space scope. For example:

<!DOCTYPE foo [
<!ENTITY Jane "<employee>\n
\t<name> Jane </name>\n
\t<title>Software Design Engineer</title>\n
</employee>">
]>
<foo xml:space="preserve">&Jane;</foo>

Assuming that preserveWhiteSpace=false (in the scope of the DOCTYPE tag), the insignificant white space is lost when the entity is parsed. The entity will NOT have white space nodes. The tree will look like this:

DOCTYPE foo
    ENTITY: Jane
        ELEMENT: employee
            ELEMENT: name
                TEXT: Jane 
            ELEMENT: title
                TEXT>:Software Design Engineer
    ELEMENT: foo
       ATTRIBUTE: xml:space="preserve"
       ENTITYREF: Jane

Notice that the DOM tree exposed under the ENTITY node inside the DOCTYPE does NOT contain any WHITESPACE nodes. This means that the children of the ENTITYREF node will also have no WHITESPACE nodes even though the entity reference is in the scope of xml:space="preserve".

Every instance of an ENTITY referenced in a given document always has the identical tree.

If an entity absolutely must preserve white space then it must specify its own xml:space attribute inside itself or the document preserveWhiteSpace switch must be set to true.

White space in attributes

There are several ways of accessing an attribute value. The IXMLDOMAttribute interface has a nodeValue property, a value property which is equal to nodeValue and a text property which is the Microsoft extension. These properties return the following:

property text returned

attrNode.nodeValue
attrNode.value
getAttribute("name") Returns exact content (with entities expanded) as found in the original document.

attrNode.nodeTypedValue Null

attrNode.text Same as nodeValue except the leading and trailing white space is trimmed.

The XML Language specification defines the following behavior for XML Applications:

Attribute type Text returned

CDATA ID, IDREF, IDREFS, ENTITY, ENTITIES, NOTATION, enumeration

half normalized fully normalized

Where half normalized means that newlines and tab characters are converted to spaces, but multiple spaces are not collapsed into one space.

White space handling with the XML object model

Sometimes the XML Object Model will show TEXT nodes containing white space characters. This can be confusing when most of the time white space is stripped. For example the following XML example:

<?xml version="1.0" ?>
<!DOCTYPE person [
  <!ELEMENT person (#PCDATA|lastname|firstname)>
  <!ELEMENT lastname (#PCDATA)>
  <!ELEMENT firstname (#PCDATA)>
]>
<person>
  <lastname>Smith</lastname>
  <firstname>John</firstname>
</person>

Generates the following tree:

Processing Instruction: xml
DocType: person
ELEMENT: person
TEXT: 
ELEMENT: lastname
TEXT: 
ELEMENT: firstname
TEXT:

The first name and last name are surrounded by TEXT nodes containing only white space because the content model for the "person" element is MIXED; it contains the #PCDATA keyword. A MIXED content model indicates that the elements can have text interspersed between them. Therefore, the following is also valid:

<person>
My last name is <lastname>Smith</lastname> and my first name is
<firstname>John</firstname>
</person>

And this results in the following similar looking tree:

ELEMENT: person
TEXT: My last name is
ELEMENT: lastname
TEXT: and my first name is
ELEMENT: firstname
TEXT:

Without the white space after the word "is" and before <lastname>, and the white space after the </lastname> and before the word "and" that the sentence would be unintelligible. So, for MIXED content models, the combination of text, white space, and elements is relevant. For non-MIXED content models this is not the case.

To make the white-space-only TEXT nodes go away, remove the #PCDATA keyword from the "person" element declaration:

<!ELEMENT person (lastname,firstname)>

which results in the following clean tree:

Processing Instruction: xml
DocType: person
ELEMENT: person
ELEMENT: lastname
ELEMENT: firstname

property	text returned
attrNode.nodeValue attrNode.value getAttribute("name")	Returns exact content (with entities expanded) as found in the original document.
attrNode.nodeTypedValue	Null
attrNode.text	Same as nodeValue except the leading and trailing white space is trimmed.

Attribute type	Text returned
CDATA	ID, IDREF, IDREFS, ENTITY, ENTITIES, NOTATION, enumeration
half normalized	fully normalized