Well-formed HTML simply means HTML that conforms to the rules of XML. This means that the same HTML tags are available, but the stricter XML syntax is required. An XSL style sheet is itself XML and thus any HTML within it must be well-formed.
In addition to HTML within an XSL style sheet, you should consider authoring well-formed HTML for its own sake as well. Much of the industry is moving toward well-formedness as a way to increase the robustness of the Web, while simplifying and accelerating the processing of well-formed documents and data. Well-formedness has great advantages for authoring tools and can benefit hand authoring by ensuring that the markup is unambiguous. The industry expectation is that a future HTML standard will be an XML application.
The price for these benefits is that a less-forgiving syntax must be used.
Writing well-formed HTML is really quite simple. Here are the main points you should watch for as you author or convert to well-formed HTML.
HTML allows certain end tags to be optional, the most common being <P>, <LI>, <TR>, and <TD>. XML requires all tags to be closed explicitly.
HTML | Well-formed HTML |
|
|
Leaf nodes must also be closed by placing a forward slash (/) within the tag. The most common examples are <BR>, <HR>, <INPUT>, and <IMG>.
HTML | Well-formed HTML |
|
|
XML does not allow start and end tags to overlap, but enforces a strict hierarchy within the document.
HTML | Well-formed HTML |
|
|
Choose a consistent case for open and close tags. At this point which case to use isn't specified, as long as it is consistent between start and end tags The examples generally use upper case for HTML elements.
HTML | Well-formed HTML |
|
|
All attributes must be surrounded by quotation marks, either single or double.
HTML | Well-formed HTML |
|
|
Shortcuts that eliminate the <HTML> element as the single top-level element are not allowed.
HTML | Well-formed HTML |
|
|
XML defines only a minimal set of built-in character entities:
Numeric character entities are supported. The list of numeric values for entities can be found at HTML Character Sets.
Script blocks in HTML can contain unparseable characters, namely < and &. These need to be escaped in well-formed HTML by using character entities, or by enclosing the script block in a CDATA section.
In addition, JScript® (compatible with ECMA 262 language specification) comments terminate at the end of the line, so preserving the white space within script blocks containing comments is important. The xml:space attribute value by default normalizes white space by compressing adjacent white space characters into a single space. This destroys the new line that terminates the JScript comment. Any JScript following the comment is treated as part of the comment and ignored, often resulting in script errors. The CDATA notation also ensures the white space is preserved.
The following HTML script block contains both an unparseable character (<) and JScript comments. The well-formed script block uses CDATA to encapsulate the script.
HTML | Well-formed HTML |
|
|
While not all scripts will fail if not escaped in this way, it is highly recommended that you do it as a matter of habit. This ensures not only that the script will work if it contains escaped characters or comments now, but will continue to work if these characters are added in the future.