Microsoft XML 2.5 SDK


 

Authoring Well-Formed HTML

[This is preliminary documentation and subject to change.]

Well-formed HTML simply means HTML that conforms to the rules of XML. This means that the same HTML tags are available, but the stricter XML syntax is required. An XSL style sheet is itself XML and thus any HTML within it must be well-formed.

In addition to HTML within an XSL style sheet, you should consider authoring well-formed HTML for its own sake as well. Much of the industry is moving toward well-formedness as a way to increase the robustness of the Web, while simplifying and accelerating the processing of well-formed documents and data. Well-formedness has great advantages for authoring tools and can benefit hand authoring by ensuring that the markup is unambiguous. The industry expectation is that a future HTML standard will be an XML application.

The price for these benefits is that a less-forgiving syntax must be used.

Writing well-formed HTML is really quite simple. Here are the main points you should watch for as you author or convert to well-formed HTML.

All Tags Must Be Closed

HTML allows certain end tags to be optional, the most common being <P>, <LI>, <TR>, and <TD>. XML requires all tags to be closed explicitly.

HTML Well-formed HTML
<P> This is an HTML paragraph.
<P>or two.
<P>This is an HTML paragraph.</P>
<P>or two.</P>

Leaf nodes must also be closed by placing a forward slash (/) within the tag. The most common examples are <BR>, <HR>, <INPUT>, and <IMG>.

HTML Well-formed HTML
<IMG src="sample.gif"
     width="10" height="20">
<IMG src="sample.gif"
     width="10" height="20" />

No Overlapping Tags Are Allowed

XML does not allow start and end tags to overlap, but enforces a strict hierarchy within the document.

HTML Well-formed HTML
<B>Well <I>Hello</B> Dolly!</I>
<B>Well</B> <I><B>Hello</B> Dolly!</I>

Case Matters

Choose a consistent case for open and close tags. At this point which case to use isn't specified, as long as it is consistent between start and end tags The examples generally use upper case for HTML elements.

HTML Well-formed HTML
<B><i>Hello Dolly!</I></b>
<B><I>Hello Dolly!</I></B>

Quote Your Attributes

All attributes must be surrounded by quotation marks, either single or double.

HTML Well-formed HTML
<IMG src=sample.gif 
     width=10 height=20 >
<IMG src='sample.gif'
     width="10" height="20" />

Use a Single Root

Shortcuts that eliminate the <HTML> element as the single top-level element are not allowed.

HTML Well-formed HTML
<TITLE>Funky markup</TITLE>
<BODY>
  <P>Amazing that this HTML works.</P>
</BODY>
<HTML>
  <HEAD>
    <TITLE>Clean markup</TITLE>
  </HEAD>
  <BODY>
    <P>Not nearly so amazing that 
    this well-formed HTML works.</P>
  </BODY>
</HTML>

Fewer Built-in Entities

XML defines only a minimal set of built-in character entities:

Numeric character entities are supported. The list of numeric values for entities can be found at HTML Character Sets.

Escape Script Blocks

Script blocks in HTML can contain unparseable characters, namely < and &. These need to be escaped in well-formed HTML by using character entities, or by enclosing the script block in a CDATA section.

In addition, JScript® (compatible with ECMA 262 language specification) comments terminate at the end of the line, so preserving the white space within script blocks containing comments is important. The xml:space attribute value by default normalizes white space by compressing adjacent white space characters into a single space. This destroys the new line that terminates the JScript comment. Any JScript following the comment is treated as part of the comment and ignored, often resulting in script errors. The CDATA notation also ensures the white space is preserved.

The following HTML script block contains both an unparseable character (<) and JScript comments. The well-formed script block uses CDATA to encapsulate the script.

HTML Well-formed HTML
<SCRIPT>
  // checks a number against 7
  function less-than-seven(n) {
    return n < 7;
  }
</SCRIPT>
<SCRIPT><![CDATA[
  // checks a number against 7
  function less-than-seven(n) {
    return n < 7;
  }
]]></SCRIPT>

While not all scripts will fail if not escaped in this way, it is highly recommended that you do it as a matter of habit. This ensures not only that the script will work if it contains escaped characters or comments now, but will continue to work if these characters are added in the future.