Why XML

This paper discusses the use of Extensible Markup Language (XML) as a standard format for data. It provides an overview of what XML is, why it came about, and why it is an extremely valuable and useful technology for representing and exchanging data.

Why Use XML?

The Web has enabled us to communicate with anyone, anywhere. Widely-accepted standards, crucial to using the Web to its full potential, allow Web communication on numerous layers of interoperating technology. One important layer is the visual display and user interface, exemplified by current standards like HTML, GIF, and JScript™. These standards allow a page to be created once and be displayed at different times by many receivers.

Although visual and user interface standards are a necessary layer, they are insufficient for representing and managing data. Today, the Internet is merely an access medium to text and pictures. There are no standards for intelligent search, data exchange, adaptive presentation, and personalization. The Internet must go beyond setting an information access and display standard. It must set an information understanding standard, a common way of representing data so software can better search, move, display, and manipulate information hidden in contextual obscurity. HTML cannot do this because it is a format that describes how a Web page should look; it does not represent data. For example, HTML does not:

provide a standard way for a doctor to send a prescription to a pharmacist.
enable a medical laboratory to publish statistical information in a format that any receiver can analyze.
describe an electronic payment in a form that any recipient can decode and process.
provide a standard way to search law libraries for all litigation documents about a certain topic.
specify how information in a company catalog can be transmitted in a way that allows a salesman to work offline, show the catalog to clients, take orders, and upload those orders in a standard format.

In short, while HTML provides rich facilities for display, it does not provide any standards-based way to manage data.

A standard for data representation will expand the Internet in much the same way that the HTML standard for display did a few years ago. The data standard will be the vehicle for business transactions, publication of personal preference profiles, automated collaboration, and database sharing. Medical histories, pharmaceutical research data, semiconductor part sheets, and purchase orders will all be written in this format. It will open up a wide variety of new uses, all based on a standard representation for moving structured data around the Web as easily as we move HTML pages today. The data standard is XML and XML extensions.

What Is XML?

XML is a meta-markup language that provides a format for describing structured data. This facilitates more precise declarations of content and more meaningful search results across multiple platforms. In addition, XML will enable a new generation of Web-based data viewing and manipulation applications.

Structural Representation of Data

XML provides a structural representation of data that can be implemented broadly and is easy to deploy. XML is a subset of SGML optimized for delivery over the Web; defined by the World Wide Web Consortium (W3C) , XML ensures that structured data will be uniform and independent of applications or vendors. This resulting interoperability is kick-starting a new generation of business and electronic-commerce Web applications.

XML, which provides a data standard that can encode the content, semantics, and schemata for a wide variety of cases ranging from simple to complex, can be used to mark up the following:

An ordinary document.
A structured record, such as an appointment record or purchase order.
An object with data and methods, such as the persistent form of a Java object or ActiveX control.
A data record, such as the result set of a query.
Meta-content about a Web site, such as Channel Definition Format (CDF).
Graphical presentation, such as an application's user interface.
Standard schema entities and types.
All links between information and people on the Web.

Once the data is on the client desktop, it can be manipulated, edited, and presented in multiple views, without return trips to the server. Servers can now become more scalable, due to lower computational and bandwidth loads. Also, since data is exchanged in the XML format, it can be easily merged from different sources.

XML is valuable to the Internet, as well as to large corporate intranet environments, because it provides interoperability using a flexible, open, standards-based format, with new ways of accessing legacy databases and delivering data to Web clients. Applications can be built more quickly, are easier to maintain, and can easily provide multiple views on the structured data.

XML Documents

XML is a text-based format, similar to HTML in many respects, designed specifically to store and transmit data. An XML source is made up of XML elements, each of which consists of a start tag (<title>), an end tag (</title>), and the information between the two tags (referred to as the content). Like HTML, an XML document holds text annotated by tags. However, unlike HTML, XML allows an unlimited set of tags, each indicating not how something should look, but what something means. For example, an XML element might be tagged as a price, an order number, or a name. It is up to each document's author to determine what kind of data to use and which tag names fit best.

XML documents are easy to create. If you are familiar with HTML, you can quickly learn to author in XML. In this example, XML is used to describe a weather report. This file can be saved with an extension of XML, like Weather.xml.

<weather-report>
<date>March 25, 1998</date>
<time>08:00</time>
<area>
   <city>Seattle</city>
   <state>WA</state>
   <region>West Coast</region>
   <country>USA</country>
</area>
<measurements>
   <skies>partly cloudy</skies>
   <temperature>46</temperature>
   <wind>
      <direction>SW</direction>
      <windspeed>6</windspeed>
   </wind>
   <h-index>51</h-index>
   <humidity>87</humidity>
   <visibility>10</visibility>
   <uv-index>1</uv-index>
</measurements>
</weather-report>

Rather than describing the order and fashion in which the data should be displayed, the tags indicate what each item of data means (whether it is a <date> element, an <area> element, and so forth). Any receiver of this data can then decode the document, using it for their own purposes. For example, an individual might use it to make plans for the day, while a weather researcher might use it as data in a historical record of Seattle.

Extensible

In XML you can define an unlimited set of tags. While HTML tags can be used to display a word in bold or italic, XML provides a framework for tagging structured data. An XML element can declare its associated data to be a retail price, a sales tax, a book title, the amount of precipitation, or any other desired data element. As XML tags are adopted throughout an organization, and by others across the Internet, there will be a corresponding ability to search for and manipulate data regardless of the applications within which it is found. Once data has been located, it can be delivered over the wire and presented in a browser in any number of ways, or it can be handed off to other applications for further processing and viewing.

A tag represents a piece of data. Often it will correspond to a field in a table. However, this is not at all necessary. The tag may be a calculated column (price times quantity). There is no reason to expect that an XML file represents data in one table. Just as often, the XML will represent the results of a query involving multiple tables. And as long as the receiving application can make sense of the data in the XML, it is immaterial where the data comes from and how it found its way into the XML file.

Data is Separated From the Presentation and the Process

The power and beauty of XML is that it maintains the separation of the user interface from the structured data. HTML specifies how to display data in a browser, XML defines the content. In HTML you use tags to tell the browser to display data as bold or italic; with XML you only use tags to describe data, such as city name, temperature, and barometric pressure. In XML, you use style sheets such as Extensible Style Language (XSL) and Cascading Style Sheets (CSS) to present the data in a browser. XML separates the data from the presentation and the process, enabling you to display and process the data as you wish by applying different style sheets and applications.

This separation of data from presentation enables the seamless integration of data from many sources. Customer information, purchase orders, research results, bill payments, medical records, catalog data, and other sources can be converted to XML on the middle tier, allowing data to be exchanged online as easily as HTML pages display data today. Data encoded in XML can then be delivered over the Web to the desktop. No retrofitting is necessary for legacy information stored in mainframe databases or documents, and because HTTP is used to deliver XML over the wire, no changes are required for this function.

Making XML Data Self-Describing

With XML, Document Type Definitions (DTDs) can accompany a document, essentially defining the rules of the document, such as which elements are present and the structural relationship between the elements. DTDs help to validate the data when the receiving application does not have a built-in description of the incoming data. With XML, however, DTDs are optional.

Data sent along with a DTD is known as valid XML. In this case, an XML parser could check incoming data against the rules defined in the DTD to make sure the data was structured correctly. Data sent without a DTD is known as well-formed. Here an XML-based document instance, such as the hierarchically structured weather data shown above, can be used to implicitly describe itself.

With both valid and well-formed XML, XML encoded data is self-describing since descriptive tags are intermixed with the data. The open and flexible format used by XML allows it to be employed anywhere a need exists for the exchange and transfer of information. This makes it extremely powerful.

For instance, XML can be used to describe information about HTML pages, or it can be used to describe data contained in business rules or objects in an electronic-commerce transaction, such as invoices, purchase orders, and order forms. Because XML is separate from HTML, XML can be added inside HTML documents. The W3C has defined a format by which XML-based data, or XML data islands, can be encapsulated in HTML pages. By embedding XML data inside an HTML page, multiple views can be generated from the delivered data, using the semantic information contained in the XML. Also, XML can be used for compelling applications like distributed printing, database searches, and others.

Schemas

A schema is a formal specification of the rules of an XML document, namely the element names, that indicates which elements are allowed in a document and in what combinations. New schema languages, as defined in the XML-Data Working Group XML-Data and Document Content Description (DCD) proposals submitted to the W3C, provide the same functionality as a DTD. However, because these schema languages are extensible, developers can augment them with additional information, such as those data types, inheritance, and presentation rules. This makes these new schema languages far more powerful than DTDs.

With XML-Data and DCD, Microsoft and others have proposed vocabularies for expressing the schema for an XML document using XML itself. This allows XML data to describe its own structure. Expressing schemata within XML adds great power to the XML format, because it is then possible for software examining certain data to understand its structure without having any prior built-in descriptions of the data's structure.

Using a schema, an author can define precisely which element names are permitted in a document and, within each element, which subelements, attributes, and relations are allowed. An author can import fragments from other schemata, and extend types through inheritance. This allows complex relationships between elements, while retaining the simplicity of a lexical tree structure.

Authors can invent their own schemata, or they can share ones created by other authors. Readers can check the schema references to verify that the document they have received is the correct type. They can also use the information in the schema to validate the structure of the document automatically.

Companies that want to use XML need a simple way to find the information about the schemata, documents, and business processes that other businesses and applications support. Imagine the tremendous cost to consumers and businesses alike if every business was left to define its own way of publishing this information. Even with the Web, the costs associated with setting up and maintaining a Web site are beyond the abilities of some businesses. With no limit to the number of businesses that could publish this information, the lack of standards that define how to publish this information in a safe and controlled way would lead to thousands and thousands of different implementations, navigation approaches, and depth of content. The cost burden of allowing this "wild" environment to propagate would spread to the consumers.

Microsoft has chosen to minimize this problem by creating and managing www.biztalk.org . This site will grow into a portal for locating, managing, learning about, and publishing XML, XSL, and the information models used in thousands of applications. A fully-functional online repository of schemata is scheduled for delivery in early fall of 1999.

Open Standards

XML is based on proven standards-based technology optimized for the Web. Microsoft is working with other leading companies and working groups at the W3C to help ensure interoperability and support for developers, authors, and users on multiple systems and browsers, and to evolve the XML standard.

The XML initiative consists of a set of related standards:

Extensible Markup Language (XML) is a Recommendation, the final stage in the W3C approval process. This means that the standard is stable and can be fully embraced by Web and tools developers.
XML Namespaces is a Recommendation, describing namespace syntax and support for namespace-aware XML parsers.
The Document Object Model (DOM) Level 1 is a Recommendation, providing a standard for programmatic access to structured data through scripting, so developers can consistently interact with and compute on XML-based data.
Extensible Stylesheet Language (XSL) is currently a working draft. XSL has two modular sections: the XSL Transformation Language and the XSL Formatting Objects. The transformation language can be used to transform XML for display. Since the two parts of XSL are modular, the transformation language can be used independently for general-purpose transformations, including converting XML to well-formed HTML. CSS can be applied to simply-structured XML data but cannot present information in an order different from how it was received.
XML Linking Language (XLL) and its companion XML Pointer Language (XPointer) are currently working drafts. XLL is an XML linking language that provides links in XML similar to those in HTML but offers more power. With XLL, linking could be multidirectional, and links could exist at an object level rather than just at a page level. Internet Explorer 5 has no inherent support for XLL.

XML structural schemata such as those described by XML-Data Note and Document Content Description for XML (DCD) are subjects of the W3C XML-Data Working Group as well.

Benefiting from XML

XML brings so much power and flexibility to Web-based applications, it provides a number of compelling benefits to developers and users:

More meaningful searches
Development of flexible Web applications
- Data integration from disparate sources
- Local computation and manipulation of data
- Multiple views of the data
- Granular updates

Development of Flexible Web Applications

Once data has been found, XML can be delivered to other applications, objects, and middle-tier servers for further processing, or it can be delivered to the desktop for viewing in a browser. XML, together with HTML for display, scripting for logic, and a common object model for interacting with the data and display, provides the technologies needed for flexible three-tier Web application development.

Data Integration From Disparate Sources

The ability to search multiple, incompatible databases is virtually impossible today. XML enables structured data from different sources to be easily combined. Software agents can be used to integrate data on a middle-tier server from back-end databases and other applications. This data can then be delivered to clients or other servers for further aggregation, processing, and distribution.

The extensibility and flexibility of XML allow it to describe data contained in a wide variety of heterogeneous applications, from describing collections of Web pages to data records. Again, since XML-based data is self-describing, data can be exchanged and processed without having a built-in description of the incoming data.

Local Computation and Manipulation

After being delivered to the client, data in XML format can be parsed and locally edited and manipulated, with computations performed by client applications. Users can manipulate data in various ways, rather than merely presenting it. The XML Document Object Model (DOM) also allows data to be manipulated with scripting or other programming languages. Data computations can be performed without additional return trips to the server. Separating the user interface that views data from the data itself allows powerful applications, formerly found only on high-end databases, to be created naturally for the Web using a simple, flexible, open format.

Multiple Views of Data

Once data has been delivered to the desktop, it can be viewed in different ways. By describing structured data in a simple, open, and extensible manner, XML complements HTML, which is widely used to describe user interfaces. Again, while HTML describes the appearance of data, XML describes data itself. Since display is now separate from data, having this data defined in XML allows different views to be specified, resulting in data being presented appropriately. Local data can be presented dynamically in a manner determined by client configuration, user preference, or other criteria. CSS and XSL provide declarative mechanisms for describing a particular view of the data.

Granular Updates

Data can be granularly updated with XML, eliminating the need to resend an entire structured data set each time a portion of the data changes. Only the changed element must be sent from the server to the client, and the changed data can be displayed without refreshing the entire user interface. Presently, an entire page must be rebuilt if even one item of data changes, even when the view remains constant. This severely limits server scalability.

Also, XML allows other data to be added, such as predicted high and low temperatures, expected precipitation, and probability (in percent). This additional information can stream into the user's existing view without the browser having to send a new view. If additional information such as barometric pressure is requested, it can be sent without rebuilding.

Futures

As an industry standard for expressing structured data, XML offers many advantages to organizations, software developers, Web sites, and end users. The opportunities will expand further as more vertical market data formats are created for key markets such as advanced database searching, online banking, medical, legal, electronic commerce, and other fields. And when sites dispense data rather than just views on data, extraordinary opportunities result.

Customer services are now migrating to Web sites from call centers and physical locations and will therefore benefit from the robust functionality of XML. And, because most of these business applications involve manipulation or transfer of data and database records, such as purchase orders, invoices, customer information, appointments, maps, and so on, XML will revolutionize end-user possibilities on the Internet by allowing a rich array of business applications to be implemented. In addition, information already on Web sites, whether stored in documents or databases, can be marked up using XML-based, intranet-oriented vocabularies. These vocabularies also help small- and medium-sized corporations that need to exchange information between customers and suppliers.

A vital untapped market is development tools that make it easy for end users to build their own collaborative Web sites, including tools for generating XML data from legacy database information and from existing user interfaces. In addition, standard schemata could be developed for describing portfolios or other data, for example, which could use the layout, graphs, and other functions of Excel or other existing spreadsheets. Declarative and visual tools for describing XML generated from legacy databases are a powerful opportunity. Custom tools for viewing XML data can be written in the Visual Basic® development system, Java, and C++.

XML will require powerful new tools for presenting rich, complex XML data within a document. This is done by mapping a user-friendly display layer on top of a complex set of hierarchical data that can change dynamically. Possible layouts to use for XML data include collapsing outlines, PivotTable dynamic views, and a simple sheet for each portfolio.

Web sites can offer stock quotes, news articles, or real-time traffic data, which can be obtained by filtering from Web broadcasts or by intelligent polling of a tree of servers replicating these sites. Information overload can be avoided with XML by writing custom rules for the aging of information, as is done with e-mail. XML-based tools for users to construct these rules and server and client software to execute them are a huge opportunity. A Standard Object Model could enable these functions, typically written in script, to filter incoming messages, examine stored messages, create outgoing messages, access databases, and so on. These agents can be written to run anywhere automatically.