Making XML Data Self-Describing

[This is preliminary documentation and subject to change.]

Today with XML, Document Type Definitions (DTDs) can accompany a document, essentially defining the rules of the document, such as which elements are present and the structural relationship between the elements. DTDs help to validate the data when the receiving application does not have a built-in description of the incoming data. With XML, however, DTDs are optional.

Data sent along with a DTD is known as valid XML. In this case, an XML parser could check incoming data against the rules defined in the DTD to make sure the data was structured correctly. Data sent without a DTD is known as well-formed. Here an XML-based document instance, such as the hierarchically structured weather data shown in Writing Well-Formed XML Documents, can be used to implicitly describe itself.

With both valid and well-formed XML, XML encoded data is self-describing since descriptive tags are intermingled with the data. The open and flexible format used by XML allows it to be employed anywhere a need exists for the exchange and transfer of information. This makes it extremely powerful.

For instance, XML can be used to describe information about HTML pages, or it can be used to describe data contained in business rules or objects in an electronic-commerce transaction, such as invoices, purchase orders, and order forms. Since XML is separate from HTML, XML can also be added inside HTML documents. The World Wide Web Consortium (W3C) has defined a format by which XML-based data, or XML data islands, can be encapsulated in HTML pages. By embedding XML data inside an HTML page, multiple views can be generated from the delivered data, using the semantic information contained in the XML. Moreover, XML can be used for such compelling applications as distributed printing, database searches, and others.

A schema is a formal specification of the rules of an XML document, namely the element names, that indicates which elements are allowed in a document and in what combinations. New schema languages, such as defined in the XML-Data Working Group XML-Data and Document Content Description (DCD) proposals submitted to the W3C, provide the same functionality as a (DTD). However, because these schema languages are extensible, developers can augment them with additional information, such as those data types, inheritance, and presentation rules. This makes these new schema languages far more powerful than DTDs.

With XML-Data and DCD, Microsoft and others have proposed vocabularies for expressing the schema for an XML document using XML itself. This allows XML data to describe its own structure. Expressing schemata within XML adds great power to the XML format, because it is then possible for software examining certain data to understand its structure without having any prior built-in descriptions of the data's structure.

Using a schema, an author can define precisely which element names are permitted in a document and, within each element, which subelements, attributes, and relations are allowed. An author can import fragments from other schemata, and extend types through inheritance. This allows complex relationships between elements, while retaining the simplicity of a lexical tree structure.

Authors can invent their own schemata, or they can share ones created by other authors. Readers can check the schema references to verify that the document they have received is the correct type. They can also use the information in the schema to validate the structure of the document automatically.