A Technical Perspective of XML

Microsoft Corporation

June 23, 1997
Revised: February 20, 1998

Introduction

The Web has placed in our hands the potential to communicate with anyone, anywhere. Fully realizing its potential depends on widespread use of standards, because, as with the telephone, this communication depends on numerous layers of interoperating technology. One such important layer is visual display and user interface, exemplified by standards such as Hypertext Markup Language (HTML), Graphics Interchange Format (GIF), and Microsoft® JScript™. These standards allow a page to be created once, yet displayed at different times by many receivers.

Although visual and user-interface standards are a necessary layer, they are not sufficient for representing and managing data. Today, the Internet is merely an access medium to text and pictures. There are no standards for intelligent search, data exchange, adaptive presentation, and personalization. The Internet must go beyond setting an information access and display standard. It must set an information understanding standard: a standard way of representing data so that software can better search, move, display, and otherwise manipulate information currently hidden in contextual obscurity. HTML cannot do this, because HTML is a format that describes how a Web page should look, and does not represent data. For example, HTML does not:

Provide a standard way for a doctor to send a prescription to a pharmacist.
Enable a medical laboratory to publish statistical information in a format that any receiver can analyze.
Describe an electronic payment in a form that any recipient can decode and process.
Provide a standard way to search legal libraries, for example, to find all litigation documents about a certain topic.
Specify how information in a company catalog can be transmitted, such that a salesman can work offline, show the catalog to clients, take orders, then upload those orders in a standard format.

In short, while HTML provides rich facilities for display, it does not provide any standards-based way to manage data as data.

A standard for data representation will expand the Internet in much the same way that the HTML standard for display did a few years ago. The data standard will be the vehicle for business transactions, publication of personal preference profiles, automated collaboration, and database sharing. Medical histories, pharmaceutical research data, semiconductor part sheets, and purchase orders all will be written in this format. It will open up a wide variety of new uses, all based on a standard representation for moving structured data around the Web as easily as we move HTML pages today. The data standard is XML and XML extensions.

This paper shows how XML can be used as a standard format for data, and is based on proposals currently before the World Wide Web Consortium (W3C) standards organization, including recent proposals from Microsoft Corporation.

XML Support in Internet Explorer 4.0

Microsoft Internet Explorer 4.0 comes equipped with a C++ parser for the XML language, and supports the XML object model. The XML object model allows developers to interact with and manipulate each XML element, as elements are exposed as objects. In addition, Microsoft has made available an XML Data Source Object (DSO), which uses the data-binding facility in Dynamic HTML to display XML as HTML.

To help developers get started using XML as a data format, Microsoft has posted XML resources in the Extensible Markup Language (XML) section of the Microsoft Site Builder Network Web site at http://www.microsoft.com/xml/.

Other XML resources, including the Technology Preview Release of MSXSL, the Microsoft Extensible Style Sheet (XSL) processor, can be found at http://www.microsoft.com/xml/xsl/. The XSL Processor allows developers to transform XML data to HTML, through a style sheet that defines presentation rules.

XML: A Standard Format for Data

XML provides a data standard that can encode the content, semantics, and schemata for a wide variety of cases ranging from simple to complex. XML can be used to mark up the following:

An ordinary document.
A structured record, such as an appointment record or purchase order.
An object, with data and methods (for example, the persistent form of a Java object or an ActiveX® control).
A data record, such as the result set of a query.
Metacontent about a Web site (such as Channel Definition Format (CDF) information).
Graphical presentation (such as an application's user interface).
Standard schema entities and types.
All the links between information and people on the Web.

The flexibility of a single data-representation format allows any software to determine the semantics of a data element. Information can then be reused for new purposes and in novel contexts. For example, a record from a database of restaurants and a record from a client contact database might both be reused in the context of an appointment—for example, in setting a lunch date with a client. The relationships between the restaurant and contact data do not reside in the schema data described by either database individually, but are extensions defined by the instance of the appointment.

XML provides a structural representation of data that can be implemented broadly and developed easily. Industrial implementations in the SGML community and elsewhere demonstrate the intrinsic quality and industrial strength of the XML tree-structured data format.

Benefits of XML

XML provides a powerful, flexible format for expressing data—whether as a wire format for sending data between client and server, a transfer format for sharing data between applications, or a persistent storage format on disk.

As a universal data format, XML data delivered to the desktop can be manipulated locally through script using the XML Object Model, and can be made available for delivery to local applications. Also, structured data from multiple sources can be integrated both on a middle-tier server and locally on the client.

To enable this integration between databases and applications, structured data in XML can be self-describing with either Document Type Definitions (DTD) or, alternatively, with powerful XML-Data Schemas. Schemas and DTDs both provide a description of the structure of the data, if it does not already contain a built-in description.

Since the data is now separate from the presentation, the same XML data could be presented in multiple ways on the desktop. For instance, an XML-based purchase order could display a highly detailed view of a transaction to the purchasing agent and a much simpler view to the consumer.

For Web sites, XML offers a mechanism for adding metadata or metacontent to HTML. For example, the proposed Channel Definition Format (CDF) defines an application of the XML language by specifying collections of Web pages. That allows a Web site to publish existing HTML content as a channel for "push" clients.

XML also provides a means for embedding data within HTML, extending the possibilities for Web-based applications based on HTML and scripts. Once embedded within an HTML page, XML data can be updated without having to refresh the entire page. Through this process of granular updating, XML allows the content of HTML pages to be more efficient and more dynamic.

For end-users, XML promises to provide a much richer set of Web applications for browsing, communication, and collaboration. The growing use of XML will improve Web-browsing applications for viewing, filtering, and manipulating information on the Internet. For example, because XML enables publishers to supplement their Web sites with metadata such as CDF, users can receive "pushed" content as structured channels.

As collaboration on the Web spreads to more businesses, customer services will eventually migrate from phone lines and storefronts to Web sites. The majority of these intranet and extranet business applications will involve manipulation or transfer of data and database records, such as purchase orders, invoices, customer information, appointments, maps, and so forth. XML promises a revolution in the richness of end-user possibilities on the Web, because it enables such a wide array of business applications to be implemented on the Internet.

XML Syntax

XML is a text-based format, similar to HTML in many respects, designed especially to store and transmit data. An XML source is made up of XML elements, each of which consists of a "start tag" (such as <title>), an "end tag" (such as </title>), and the information between the two tags (referred to as the contents). Like HTML, an XML document holds text annotated by tags. However, unlike HTML, XML allows an unlimited set of tags, each indicating not how something should look, but what something means. For example, an XML element might be tagged as a price, an order number, or a name. It is up to each document's author to determine what kind of data to use and which tag names are most descriptive.

Let's look at some XML:

<order>
  <sold-to>
    <person>
      <lastname>Layman</lastname>
      <firstname>Andrew</firstname>
    </person>
  </sold-to>
  <sold-on>19970317</sold-on>
  <item>
    <price>5.95</price>
    <book>
      <title>Number, the Language of Science</title>
      <author>Dantzig, Tobias</author>
    </book>
  </item>
  <item>
    <price>12.95</price>
    <book>
      <title>Introduction to Objectivist
            Epistemology</title>
      <author>Rand, Ayn</author>
      <isbn>0-452-01030-6</isbn>
    </book>
  </item>
  <item>
    <price>12.95</price>
    <record>
      <title><composer>Tchaikovsky</composer>'s First Piano Concerto</title>
      <artist>Janos</artist>
    </record>
  </item>
  <item>
    <price>1.50</price>
    <coffee >
      <size>small</size>
      <style>cafe macchiato</style>
    </coffee>
  </item>
</order>

Rather than describing the order and fashion in which the data should be displayed, the tags indicate what each item of data means (whether it is a <title> element, an <author> element, and so forth.). Any receiver of this data can then decode the document, each using it for his own purposes. For example, the bookstore might use it to fill the order A, a market analyst might use many similar orders to discover which books are most popular, and an individual might file it as a record of his purchases.

XML also supports text markup, in which an element's text contains tags in the middle of the text flow. These usually indicate something special about the text's meaning. For example:

<title><composer>Tchaikovsky</composer>'s First Piano Concerto</title>

Here, the purpose of the <composer> element was not to separate "Tchaikovsky" from the rest of the record's title, but to indicate that, in addition to being part of the title, it is also the composer's name. Anyone looking for composers would search on such tags, while anyone processing the data (looking for the record's title) would skip those tags to arrive at the complete title.

Because this is an order from a bookstore, the element names reflect bookstore terminology. However, if you looked at an XML document containing medical research data, you would find experiments, temperatures, dosages, results, and so forth. Each kind of document has terms, and therefore elements, specific to its needs.

Schemata in XML: Making XML Data Self-Describing

A schema is a formal specification of the rules of an XML document, namely the element names, that indicates which elements are allowed in a document and in what combinations. Schema, as defined in the XML-Data proposal submitted to the W3C, provide the same functionality as a DTD. However, because schemas are written in XML and, through the capabilities defined in the XML-Data specification, are extensible, developers can augment schemas with additional information, such as data types, inheritance, and presentation rules. This makes schemas far more powerful than DTDs.

Using a schema, an author can define precisely which element names are permitted in the document and, within each element, the subelements, attributes, and relations that are allowed. An author can import fragments from other schemata, and extend types through inheritance. This allows complex relationships between elements, while retaining the simplicity of a lexical tree structure.

Authors can invent their own schemata, or they can share ones created by other authors. Readers can check the schema references to verify that the document they have received is the correct type. They can also use the information in the schema to validate the structure of the document automatically.

With XML-Data, Microsoft and others have proposed a DTD syntax for expressing the schema for an XML document directly within XML itself. This would allow XML data to describe its own structure. Expressing schemata within XML adds great power to the XML format, because it is then possible for software examining certain data to understand its structure without having any prior built-in descriptions of the data's structure.

Note Schemas are not supported at this time. However, they will be supported by future versions of Microsoft Internet Explorer.

XML and HTML Complement Each Other

HTML is about user interface; XML is about data. Dynamic HTML describes display and user interaction; XML describes information. This leads to a natural relationship between HTML and XML, for XML can add information to an HTML document and HTML can display information expressed in XML format.

Displaying XML Data in HTML

An XML document does not by itself specify whether or how its information should be displayed. The XML data merely contains the facts (such as who ordered which books at which prices). HTML is an ideal display language for presenting this data to an end user. For example, an employee of an online bookstore may visit a Web page to find a list of order entries. On the back end, the individual data records are expressed in XML. However, on the front end, they are presented to the employee as an HTML page. In order to construct this Web page, either the Web server or the Web browser will need to convert the XML data records into an HTML presentation, such as a table.

The mechanisms of data binding and style can be used to arrange XML data into a visual presentation, and to add interactivity. Data binding is an aspect of Dynamic HTML that moves individual items of data from an information source (such as an XML document) into an HTML display, allowing HTML to be used as a template for displaying XML data. This is similar to a "mail merge" in word processing. Microsoft currently ships an XML Data Source Object (XML DSO) as part of Internet Explorer 4.0. That XML DSO can be invoked declaratively through the <applet> tag.

XSL (Extensible Style sheet Language) can add even greater power to this process. An XSL style sheet is a collection of programming rules for how to pull information out of an XML document and transform it into another format, such as HTML. The transformation of XML into formats, such as HTML, is done in a declarative way, making it often easier and more accessible than through scripting. In addition, XSL uses XML as its syntax, freeing XML authors from having to learn another markup language. CSS can still be used for simply structured XML data—and we anticipate that in such situations, it will be useful. However, CSS does not provide a display structure that deviates from the structure of the data source. With XSL, it is possible to generate presentation structures (in HTML for instance) that are very different from the original XML data structures.

For example, an XSL style sheet could specify that a bookstore order should contain the <sold-to> name and <date> in bold letters at the top of the page, followed by a table consisting of columns for <title>, <author>, and <price> elements. Different style sheets applied to the same XML data source can produce different displays, such as an HTML table, an HTML bulleted list, or a PostScript page. Microsoft has recently made available a technology preview release of its MSXSL processor on the Extensible Markup Language (XML) section of the Microsoft Site Builder Network Web site (http://www.microsoft.com/xml/xsl/msxsl.htm).

XML as Metadata: Information about HTML Pages

XML provides a standard way to describe data, such as an HTML page. Because XML is self-describing and textual, it can be accessed by various applications, without those applications having any prior built-in description of the data's structure. This makes XML an ideal candidate for authoring metadata. Using XML as metadata to describe an HTML page (as CDF does), enables universal access to the description of that HTML page. This allows interoperability between applications that are concerned with HTML documents. For example, a database could access the XML metadata about a Web page, process that metadata, and, depending on the results of that processing, return certain information to a scheduling application.

Data Islands: Adding XML Data to HTML Pages

Adding semantic information to HTML pages is not easy. Historically, various programs have attempted to deal with this problem by using nonstandard "tricks," such as hiding data inside HTML comments. However, these comments are awkward and, unlike XML, are not exposed to the object model.

To solve this, Microsoft is working with the W3C to define a format for putting XML-based data (data islands) inside HTML pages. Extending HTML through the use of data islands will allow a wide range of applications to use HTML as the primary document or display format and also use XML embedded within these documents to hold data.

An HTML page could therefore include, among other things, specific data about the subject of the page. For instance, if the page displayed an advertisement for an author's most recent novel, the page could also contain XML data concerning that book, such as its ISBN number, publisher, or suggested retail price. It is not important that this information is displayed, but it is important that this information be accessible and understandable as data.

Summary

XML is a standard, extensible, universal format for Web-based data. It is flexible enough to handle an incredibly wide variety of information, and also allows such information to be self-describing, so that it may be manipulated by software that has not been previously exposed to a description of the underlying meaning behind the data. With its powerful expressiveness and flexibility, XML promises to add structure to data on the Internet, bringing the Web one step closer to realizing the potential for universal communication with anyone, anywhere.

Appendix: Various Technical Details

Programmatic Access to XML—the Document Object Model

In addition to providing a file format for representing data, XML needs a standard application programming interface (API) for programmatic manipulation of data. Microsoft is working with the W3C to define a standard set of properties, methods, and events for programmers and script authors to use. This set of standards, the object model, provides a simple means of reading and writing data to and from an XML tree structure. These methods enable programmers everywhere to treat XML as a universal data type for encapsulating and transferring data. Because the object model for XML matches the Document Object Model for HTML, scriptwriters can easily master XML programming. For information about the XML object model, read "The XML Object Model in Microsoft Internet Explorer 4.0."

Up-to-date object model information is available at the W3C Web site (http://www.w3.org/MarkUp/DOM/)

Namespaces in XML

XML can provide a mechanism for authors to invent new element names and also publish those names so that a community can easily agree on standard terms for representing common data elements. The Layman-Bray proposal for namespaces makes every element name subordinate to a Universal Resource Identifier (URI), which ensures that even if two authors choose the same name, they remain unambiguous. In the same way that anyone can publish his own Web pages or view pages from others, the namespace facility allows anyone to define his own dictionary of terms or to use a public namespace of common terms.

<xml>
  <xml:schema>
    <namespaceDcl href="http://www.company.com"
      name="co"/>
    <namespaceDcl href="http://www.dsig.org"
      name="dsig"/>
  </xml:schema>
  <xml:data>
    <order>
      <sold-to>
        <person>
          <lastname>Layman</lastname>
          <firstname>Andrew</firstname>
        </person>
      </sold-to>
      <sold-on>19970317</sold-on>
      <dsig:digital-signature>1234567890
          </dsig:digital-signature>
  </xml:data>
</xml>

The above code tells any reader that if a name begins with "dsig" its meaning is defined by whoever owns the "http://www.dsig.org" namespace.

Names used within the <co:item> element are presumed to come from the same namespace and, if so, do not need further qualification. Namespaces ensure that element names do not conflict, and clarify who defined which term. They do not give instructions on how to process the elements. Readers still need to know what the elements mean and decide how to process them. Namespaces simply keep the names straight.

An author can specify an element's data type (it's a number, a date, and so forth) and the format of the string's contents. One can use a LEXTYPE attribute for this purpose:

<sold-on lextype="DATE-ISO8061">19970317</sold-on>

Here, "DATE-ISO8061" specifies that the <sold-on> element's contents are a date in the format specified by the international standard ISO 8061. As with element names, authors can design their own data types, and also use types shared publicly. Microsoft is working with the W3C to define a set of standard types, and will publish a public list that anyone can freely use.

Note Namespaces are not supported at this time. However, they will be supported by future versions of Internet Explorer.

Character Set and Encoding

All information in XML is Unicode text. This includes the contents of elements and element names themselves. As a result, XML supports representation of all international character sets.

Unicode can be transmitted directly as 16-bit characters, but more commonly is transferred using an encoding that is more convenient or compact for certain languages. XML supports a range of encodings (the default is UTF-8), subject only to the restriction that an entire document must share the same encoding.

White Space

Unlike HTML, which ignores white space (spaces, tabs, new lines, and so forth), XML is for data, and thus retains all white space. For example, the following are not equivalent:

<title><composer>Tchaikovsky</composer>'s
            First Piano Concerto</title>


        <title><composer>Tchaikovsky</composer>'s
            First
            Piano Concerto</title>

Strictly a Tree

XML elements can contain text and other elements, with the exact rules for a specific document type given in its schema. However, elements must be strictly nested: Each start tag must have a corresponding end tag, and elements cannot overlap partially. The examples shown so far have all adhered to proper XML syntax. The following example does not.

<title>Evolution of Culture <sub>in Animals
      </title> by John T. Bonner</sub>

Empty Tags

XML has a shorthand for an empty element: Ending a tag with a "/>" signals that the element has no contents, and does not have an end tag. For example, the following two lines are equivalent:

<title/>


<title></title>

Reserved Characters

Several characters are part of the syntactic structure of XML and will not be interpreted as themselves if simply placed within an XML data source. You need to substitute a special character sequence (called an "entity" by XML). Note that case matters.

Table 1. Reserved Characters

<	<
&	&
>	>

For example, "Melons cost < $1 at the A&P" would be encoded as "Melons cost < $1 at the A&P".

Compression

Although simple, robust, and extensible, XML is a verbose format compared to binary schemes. Consequently, we expect that HTTP 1.1 compression will improve the efficiency of XML data transfer. Microsoft is working to popularize standard, efficient compression systems for XML.

Security

The high degree of structure in an XML document makes it easier to add digital signatures or encryption to individual parts of a document as well as to a whole document. Microsoft is working with the W3C Digital Signature Initiative to define standard, XML-based security and authentication for XML data.