XML: Data the Way You Want It

Michael Edwards
Developer Technology Engineer
Microsoft Corporation

October 31, 1997

Contents

Introduction
Is HTML Broken?
What Is XML and Where Did It Come From?
What Can You Do With XML?
Summary
For More Information

Introduction

The World Wide Web Consortium (W3C) is defining a new standard for data formats called Extensible Markup Language (XML). If you've been around the Web very long, you know that HTML is pretty good at displaying Web-page content, but pretty bad at describing Web-page content as data so that it can be successfully manipulated over the Internet. Fortunately, that is what XML is all about—providing a standard for defining your own markup tags and data structure so that data can be easily exchanged online.

XML provides the ability to augment HTML text with XML tags to make "smarter" Web pages and intranets. The XML tags let the browser know about the information that it's presenting, which enables all kinds of new capabilities:

Since co-founding the XML Working Group of the W3C over a year ago, Microsoft has been actively involved in advancing XML through draft specifications to a final recommended specification. And, just as Microsoft's Dynamic HTML offers fully compliant implementations of the W3C HTML 4.0 specification (see the W3C Web site documentation; http://www.w3.org/TR/WD-html40/) and Document Object Model interface (http://www.w3.org/DOM/). Microsoft® Internet Explorer version 4.0 includes compliant support for XML data formats. For example, if you are creating Active Channel™ sites for Internet Explorer 4.0, you are using Channel Definition Format (CDF), an XML data format for implementing push technology on the Web.

This is the first in a series of Site Builder Network articles explaining how you can use the XML features in Internet Explorer 4.0. This article explains how XML complements HTML and what XML markup looks like. You'll also learn how XML data formats are being used today, and about the tools and technologies available for authoring and working with XML. You will be able to begin working with XML today using Internet Explorer 4.0, and take advantage of future developments.

Is HTML Broken?

I've always believed in the common wisdom, "If it ain't broke, don't fix it" (unless I can find my hammer in which case I can "fix" anything). But really, XML doesn't replace HTML; in fact XML complements HTML by solving some key HTML limitations:

In short, HTML is a markup language for presentation, and XML is a language for creating markup languages that describe structured data.

What Is XML and Where Did It Come From?

XML is the universal language for describing and exchanging data on the Web. XML was born from an existing international standard called Standard Generalized Markup Language (SGML). See the Summer Institute of Linguistics Web site (http://www.sil.org/sgml/) for more information about SGM. SGML was created about 30 years ago in an effort to define a markup language for representing textual information. XML leverages the great work that went into creating SGML (and related standards) by identifying a subset specifically targeted for the Web. Hence, XML is a much smaller and simpler language that is often referred to as containing "20 percent of the complexity and 80 percent of the functionality" of SGML. Thus, like SGML, XML is a facility for creating new markups that provides a file format and data structure for representing data.

What Does XML Look Like?

XML looks a lot like HTML. If you can read HTML, you will have little trouble reading XML. XML uses the same symbols to delineate markup. However, the actual XML elements might seem unfamiliar. The main thing to remember is that XML does not define any elements of its own, except the ones that are used to define new elements. Thus, XML is self-describing: A small set of built-in XML elements are used to define any new set of elements (and their hierarchical structure) that are contained in an XML document. This self-describing nature makes XML a wonderful transport mechanism for data because you can define a new set of elements that describe a certain data set, and include that description with the marked-up data.

In addition to creating stand-alone XML documents, you can integrate XML data with HTML using the data-binding features of Internet Explorer 4.0. But I am getting ahead of myself; let's first take a walk down XML lane and look at an XML example.

An XML Example: Channel Definition Format

Creating Active Channels

CDF-specific tags enable a publisher to offer automatic delivery of information (called channels) from their Web servers to Web users. As you might infer from CDF element names, the data can include a schedule for automatic downloads, indications for newly updated content and branded logos, and a topic hierarchy.

The XML sample, "Creating Active Channels," included in the Internet Client SDK contains extensive information and sample code for creating active channels (see the MSDN™ Library, Internet Client SDK; or MSDN Online at http://www.microsoft.com/msdn/sdk/inetsdk/help/delivery/authoring/channels/development.htm#sec_channel_cdf_create).

The online version of this article (available in the Extensible Markup Language (XML) section at the Microsoft Site Builder Network Web site at http://www.microsoft.com/xml/xmldata.htm) also contains a detailed example of creating active channels.

Defining XML Data Structure

The W3C has identified design goals for the XML standard. One of those goals is to define a precise structure for XML markup so it is easy to parse and straightforward to understand. As a result, the rules for constructing start tags, end tags, attribute (name=value) pairs, and so on, are spelled out very clearly in the (draft) XML syntax specification. But the basic XML syntax rules only determine whether XML markup is well-formed, that is, whether a piece of XML markup conforms to the basic rules for how all XML markup must be constructed. To determine whether an instance of XML markup is valid, you need to determine whether it conforms to the rules for that data type—each XML data type has rules for what elements are allowed enabled and exactly how they are pieced together. To that end, XML documents have two parts: the prolog declares the element names, attributes, and construction rules of valid markup for that data type, and the document instance contains the actual markup. Thus, the XML markup contained in the document instance is valid only if it conforms to the rules laid out in the document prolog.

Prolog

The prolog section identifies an XML document or code fragment, and includes the information needed to parse the data in the file. The prolog section includes several kinds of declarations, statements that define the construction rules for that data type. The XML declarations use processing instructions. The DOCTYPE declarations together describe what is called the Document Type Definition (DTD), the rules by which the data must abide to be valid. Element names and their valid hierarchical structure, together with their allowed attributes and entities (the ELEMENT, ATTLIST, and ENTITY declarations), are declared in the DOCTYPE processing instruction and define the complete set of XML markup that authors can use in the document. As a group, the declarations are referred to as internal or external, depending on whether they are in the same file or in another file. The syntax for building DTDs from these declarations enables authors to describe documents with a high degree of structural complexity. Let's look at each kind of declaration and how to use it.

XML declaration

The first line of the prolog is an XML declaration:

<?XML Version Encoding RequiredMarkupDeclaration?>

The <? and ?> delimiters indicate a processing instruction for an XML parser. The XML processing instruction indicates an XML document, and includes:

Version

The XML standard version (currently 1.0)

Encoding

Specifies the Unicode character set that is used in the document (very important for localization issues)

RequiredMarkupDeclaration (RMD)

Indicates whether the browser needs to read the document type declarations in the prolog before it can parse the markup data

DOCTYPE declaration

Immediately following the XML declaration are the document type declarations that define this type of XML data. Sometimes an application can determine the data type from context (such as a filename extension), in which case document type declarations are optional because they are built in to the application or the application knows where to find the DTD information. In our CDF example file, the DOCTYPE declarations are not necessary. If declarations are present, the following format is observed:

<!DOCTYPE Name SYSTEM ExternalDeclarations [InternalDeclarations]>

The <! characters are an open delimiter for a document type declaration, and the DOCTYPE declaration contains these document type declarations:

Name

a value that by convention matches the name of the root element for this document type. In the above CDF example, Name = CHANNEL. Because a valid XML document must have a single root element with all other elements as children, the name of that great-granddaddy root element specifies the overall document type. In other words, the value of Name defines the XML document type.

ExternalDeclarations

a URL to a file containing any external declarations that specify the document's markup tags and their structure.

InternalDeclarations

Inline declarations specifying the document's markup tags and their structure.

Element declaration

To construct the elements that comprise a DTD you use an element declaration:

<!ELEMENT Name Content>

Name, of course, is the element's name, and is analogous to a unique data type. Content indicates other elements and character data that can be contained by the Name element type, and is constructed using its own grammar, that is, rules for how the content elements are structured. The grammar for the content in an element declaration can:

Attribute list declaration

Because you're familiar with HTML you know that most elements have attributes. In XML, you can build your own attribute list in a DTD by using the attribute list declaration:

<!ATTLIST ElementName AttributeName AttributeType AttributeDefault>

ElementName is the name of the element to which the attribute information applies. Only one ElementName is used, but because any number of attributes may be applied to it, multiple attribute definitions can occur in an ATTLIST declaration. You can give each attribute a name, declare the type of value that may be associated with it, and whether the attribute must be present or can be implied from a default value supplied in the declaration. You can also indicate the attribute as a flag that should not include a value at all. The grammar for indicating the attribute's type is a little complicated, but lets you specify whether the attribute value is:

Entity declaration

You build the entities for a DTD using the entity declaration:

<!ENTITY Name Value>

Entity references declare replacement text for an escape sequence. The escape character used, as in HTML, is the ampersand. Entities can come in handy when you want to standardize boilerplate types of material. For example, if Name is "&MyTitle" and Value is "Master of my Destiny", all occurrences of "&MyTitle" in character data would be replaced by "Master of my Destiny". You can also declare an entity called a character reference to insert characters that cannot be typed on the keyboard of the authoring platform (you are probably familiar with character references from HTML).

Document instance

After the prolog section comes the document instance, the actual character data that is marked up with the elements declared in the DTD. Just as in HTML, this part of the document consists of character data that is delimited by various start and end tags (along with the appropriate attributes) that adhere to the order and structure of the DTD.

Now you might ask what XML markup means. XML enables you to define (in a standard way) the names of elements and how they are ordered and structured, but there is nothing in the standard about what the elements themselves mean, that is, we read <AUTHOR> and infer a person who wrote something, whereas XML only interprets <AUTHOR> in the context of how it's declared in the DTD. The meaning of <AUTHOR> is established in the documentation that people write, read, and interpret.

What Can You Do with XML?

So where does the rubber meet the road when it comes to actually doing something productive with XML? XML's current usability is somewhat limited by its "newness"; it simply hasn't been around long enough. Even though the XML W3C Working Group is moving very quickly, new standards take some time to be defined before they're officially announced as a standard and broadly adopted. However, as you might have already surmised, XML is being used in a number of ways today.

XML markup provides a way to identify, exchange, and process any kind of data. The world has many terabytes of information distributed across a vast sea of incompatible information repositories. XML provides the mechanism for exchanging chunks of data with these entities in a mutually understood fashion. Because DTDs describe distinct collections of XML markup for different data sets, XML markup isn't a single conglomeration of elements where every new element adds to an ever-expanding list. Instead, DTDs enable the creation of an infinite number of distinct vocabularies.

XML Vocabularies

An XML vocabulary is defined by a specific DTD. Think of an XML vocabulary as the set of elements (words) and the rules for valid constructions of those elements (grammar) as defined by a particular DTD. The idea is that an inventor applies XML's declarative syntax to construct a DTD, and other folks make use of that DTD to create the indicated type of XML document (or record). For example, a DTD to specify drug allergies to include in medical records would be an XML application. But the XML application terminology is confusing because most of the online world thinks of an application as some type of computer program. Hence folks are starting to use the term XML vocabulary instead.

For example, when Microsoft created CDF, we invented a new XML vocabulary. Because the CDF vocabulary was created using a standards-based markup language, other people can easily use it. They will know how to construct and parse documents using the vocabulary, and they can read the documentation to interpret the meaning of the elements and their attributes. But the vocabulary itself is not a standard simply by having been defined in XML. It is usually advantageous for folks to agree on file formats, but with XML you don't have to. That is, no official standards body automatically states that the vocabulary embodies the definitive way of describing a particular data set. In essence, XML levels the playing field by providing a standards-based way to create file formats for any kind of data. If you find this subject particularly interesting, you may want to pursue the debate about XML namespaces on the searchable xml-dev mailing list archive (http://www.lists.ic.ac.uk/hypermail/xml-dev/), and read about Open Software Description (OSD) extensibility using XML namespaces in Note 6, "OSD Extensibility using XML Namespaces," submitted to the W3C and available at the W3C Web site (http://www.w3.org/TR/NOTE-OSD.html#6).

Let's take a closer look at two XML vocabularies proposed by Microsoft that are being used today:

Because SGML came first, many data formats are currently based on SGML. For example, the banking industry has standardized an Open Financial Exchange (OFX) (see details at the Microsoft Financial Services Web site [http://www.microsoft.com/finserv/ofxdnld.htm]) for exchanging financial data and instructions among financial institutions and client software. OFX enables banks to expose financial data in a single format that, as indicated in the "Online Banking & Brokerage News" press release (http://www.microsoft.com/finserv/news.htm), is supported by many companies, including CheckFree, Intuit, and Microsoft. OFX is great vocabulary for developers who write software that manages electronic financial transactions, because they have to learn just one file format. The mathematical community has also created an SGML vocabulary called Mathematical Markup Language (MML) (see the "Mathematical Markup Language WC3 Working Draft" at the WC3 Web site [http://www.w3.org/pub/WWW/TR/WD-math/], and chemistry professionals have created the Chemical Markup Language (CML) (see "An Introduction to Structured Documents," at the Venus Internet Web site [http://www.venus.co.uk/omf/cml/doc/tutorial/xml.html]). Authors for many of the existing SGML vocabularies are now in the process of creating XML versions of their DTD so their markup can be used by XML-aware browsers and development tools.

In the coming months, you can expect to see a multitude of new vocabularies appear for broad areas like searching, filtering, electronic commerce, and other areas.

XML Development Tools

With the promulgation of the XML standard, tool developers are now able to develop broad-based tools that span multiple data communities. Without a standard, it's the old chicken-or-the-egg problem—software vendors don't want to spend a ton of money developing tools for a niche market, but it is very difficult for a market to become established without great tools and broad support. Standards encourage a proliferation of tools by dramatically expanding the pool of potential users—in this way, acceptable standards create a call to action.

So what kinds of tools do you need to develop in XML? Like the Web, XML tools fall into two main categories: tools for programmers and tools for authors. The programming tools generally take the form of visualization tools and software code libraries that can be used by authors to create and manipulate XML content. Generally, the software libraries come first. For example, Internet Explorer 4.0 includes an XML object model (that can be used from C, Java, or script) that tools developers can use as a foundation for creating their high-level XML visualization tools, or other XML development tools.

Parsing XML

The tool that jump-starts all XML software development is an XML parser. That is because every XML application relies on a parser to process an XML document. Parsers take the form of a code library that exposes software interfaces to developers using higher-level languages such as C++ or Java. Using these interfaces, developers can access the structure of an XML document, enumerate its elements and their attributes, and play with stuff in the document prolog. A simple example would be an XML parser utility that checks for well-formed or valid documents, and serves as the XML equivalent of an HTML syntax (lint) checker.

Every XML development tool has an XML parser at its core, and the parsers are in turn based on some notion of an object model for an XML document. Currently, a group that includes Microsoft is working with the W3C to develop a XML object model standard. Other informal efforts are under way as well. For example, the xml-dev mailing list (the archive is available from the E-mail Lists through the Imperial College Web site at http://www.lists.ic.ac.uk/hypermail/xml-dev/) is working to define a Java-based application programming interface (API) called XAPI-J. John Tigue, an independent developer, has been maintaining a Web page for this effort at the Datachannel Web site (http://www.datachannel.com/xml/dev/). You can also download the Microsoft XML parser in Java (MSXML) from the Extensible Markup Language (XML) section of the Microsoft Site Builder Network Web site (http://www.microsoft.com/xml/parser/xmlparse.htm), and the Internet Client SDK includes the article "XML Object Model," on the XML object model used in Internet Explorer 4.0 for manipulating CDF and OSD files (MSDN Library, Internet Client SDK; or see the MSDN Web site at http://www.microsoft.com/msdn/sdk/inetsdk/help/itt/xml/xmlobj.htm#bk_xml). There are also articles about the Microsoft CDF Generator tool for authoring CDF files, see the section, "Delivering Content for the Web, Tools," in the Internet Client SDK (MSDN Library, Internet Client SDK; or see the MSDN Web site at http://www.microsoft.com/msdn/sdk/inetsdk/help/cdfgen/cdfgen.htm). If you install the Internet Client SDK you'll find CDF and OSD lint tools in the INetSDK/bin/cdftest folder, and an XML lint utility that will work with any DTD in the INetSDK/bin/xmllint folder. (If you use a lint checker, you'll avoid most of the problems in creating XML markup that novices run into.)

Authoring XML

Equipped with the ability to parse XML documents, programmers can start building high-level tools that enable authors (and users) to create, edit, browse, and search XML documents. These tools range from general-purpose editors conversant in any XML vocabulary to vocabulary-specific applications.

Because they have been around a lot longer, authoring tools that support SGML (see the W3C summary page at http://www.sil.org/sgml/gen-apps.html) are more plentiful than those currently available for XML. In fact, as far as I am aware, high-level Web-authoring tools have just begun to incorporate XML support (check out DynaBase, at the Inso Corporation Web site, http://www.inso.com/frames/consumer/db/index.htm, and ADEPT, at the ArborText Web site, http://www.arbortext.com/70rlease2.html for two examples). Given the nascent nature of XML, this shouldn't be too surprising. It makes sense that authoring tools first exploit specific, existing XML vocabularies (such as CDF and OSD). As more and more XML vocabularies are developed, it will become economically feasible for tool vendors to expand their offerings to support any XML DTD. Given the current excitement and broad industry support for XML (as shown by the diversity of participants in the W3C XML Working Group), I am confident that XML will quickly become as ubiquitous as HTML.

There are already a number of vertical tools available for working with CDF. The Microsoft Channel Wizard (http://www.microsoft.com/workshop/prog/ie4/cdfwiz/) walks you though the steps of building a channel (no understanding of CDF syntax, or any other technical details, is necessary). More advanced tools for generating CDF include the Microsoft CDF Generator discussed earlier, the Cold Fusion CDF Wizard (at the Cold Fusion Web site, http://www.coldfusion.com/), Bluestone Software's Sapphire/Web 4.0 (at the Bluestone Software Web site, http://www.bluestone.com/), and iNet Developer 3.0 by Pictorius (at the Pictorius Web site, http://www.pictorius.com/). Another offering, Microsoft Internet Information Server version 4.0 (http://www.microsoft.com/iis/default.asp), includes support for dynamically generating CDF code using server-side scripting. As a result, when browsers retrieve files with the .CDX extension, the default MIME type returned by the Web server will be the CDF MIME type (instead of HTML). And naturally the Microsoft FrontPage® 98 Beta (http://www.microsoft.com/frontpage/) also supports CDF. True to form, many other Internet-related Microsoft products are busily incorporating CDF capabilities.

In the future, many application categories (such as databases, messaging, collaboration, and productivity applications) will incorporate support for other XML vocabularies as they are defined. This will enable interoperability on the common data types used within an application category, as well as across application categories. For example, address information in a customer database can be easily shared with a Personal information Manager (PIM) application or an e-mail client.

Presenting and transforming XML

Capabilities for specifying the formatting of XML will be implemented as more advanced features of tools. The separation of content from presentation is a core design principle for XML. Because XML completely separates the notion of the markup from its intended visual presentation, authors can embed in structured data procedural descriptions of how to produce different data "views." This is an incredibly powerful mechanism for offloading as much user interaction as possible to the client computer, and also serves to reduce server traffic and to speed browser response times.

The ability to specify how XML should be visually presented to the user addresses several author needs. First, authors have to be able to specify how XML data should look when it is presented in, say, a browser. Second, authors need to be able to specify alternate structures for XML data, that is, different ways the markup "tree" might be organized, depending on who the viewer is, or what they want to look at. Finally, there is a general need to translate between competing or overlapping XML vocabularies, and even non-XML proprietary file formats. For example, you might look to word-processing applications to read and write popular XML formats, or you might expect editors to provide a means of associating HTML tags with XML markup, and to embed XML into HTML.

For this to happen, a standard style language for XML needs to be recommended, or at least proposed. As you may already know, recommending a standard style-sheet language for XML is one of three deliverables specified by the W3C's XML Activity Statement (see "SGML, XML, and Structured Document Interchange" at the W3C Web site, http://www.w3c.org/XML/Activity.html). In August 1997, Microsoft and others made a proposal for Extensible Style Language (XSL) to the W3C (see the W3C Web site, http://www.w3.org/Submission/1997/13/Overview.html). The XSL proposal is also announced on the Specs & Standards page of the Microsoft Site Builder Network Web site (http://www.microsoft.com/standards/default.asp). XSL is derived from an existing international standard complementing SGML: the Document Style Semantics and Specification Language (DSSSL). So, just as XML is a subset of SGML to support Web-based information, the new style language for XML is a subset of DSSSL (I found this nice description of DSSSL in the article "An Introduction to DSSSL," [http://itrc.uwaterloo.ca/~papresco/dsssl/tutorial.html] by Paul Prescod). XSL is a great fit with XML because it's compatible with Cascading Style Sheets (CSS) and the script languages that many Web authors already know. So, if you're familiar with enhancing HTML presentation using CSS and script, you're smiling now.

Internet Explorer 4.0 includes a built-in generalized XML parser. This means you can write XML vocabularies today that can be utilized in Microsoft's shipping Internet browser. I think general support for displaying XML data directly in the browser will be the catalyst for an explosion of new XML development. This kind of capability leverages the Web development expertise available today, and enables new XML vocabularies to be developed by anybody. And it will not require people to change the way they use the Web! You can get more information about XML support in Internet Explorer 4.0 in the Internet Client SDK (see the MSDN Library, SDK Documentation, Internet Client SDK; or see the MSDN Web site at http://www.microsoft.com/msdn/sdk/inetsdk/help/default.htm).Go to the XML data source object, and the topic "Data Binding," (MSDN Library, Internet Client SDK; or see the MSDN Web site at http://www.microsoft.com/msdn/sdk/inetsdk/help/dhtml/dhtml.htm#sec_data), and the topic, "XML Object Model," (MSDN Library, Internet Client SDK; or see the MSDN Web site at http://www.microsoft.com/msdn/sdk/inetsdk/help/itt/xml/xmlobj.htm#bk_xml).

Summary

HTML revolutionized electronic document distribution and popularized a whole new information arena—the Internet—far faster than most people predicted. It certainly doesn't take a visionary to predict that the Web will continue to change our everyday lives. XML has started the journey toward a world where every conceivable category of information has an XML format that everybody can use and understand. After all, information is most useful when it is easy for everybody to access. And it all got started because the world was able to agree on a standard way to create new data types.

For More Information

If you want more background information that will help you understand XML concepts and how XML came about, start with Robin Cover's "SGML/XML Web Page" (http://www.sil.org/sgml).

If you want to get more information about the current W3C XML activity, check out the "SGML, XML, and Structured Document Interchange" (at the W3C Web site, http://www.w3.org/XML/Activity.html). Or, start with the W3C XML home page (http://www.w3.org/XML/) and read the draft proposals yourself!

If you really want to start getting your fingernails dirty, you can read through the searchable xml-dev mailing list archive (http://www.lists.ic.ac.uk/hypermail/xml-dev/). Although discussions of the xml-wg (the current W3C-appointed decision-making body) and xml-sig (a group of experts who offer advice to xml-wg) mailing lists are confidential to W3C member organizations (and invited experts), anybody can join the xml-dev mailing list to gain perspective on the types of problems that are being addressed by all kinds of XML developers. You'll also find posts on xml-dev regarding various XML tools and technology demos released to the public. If you go down that path, be sure to check out Peter Murray-Rust's XML-DEV Jewels page (http://www.vsms.nottingham.ac.uk/vsms/xml/jewels.html).

If you would like to find out more about what we are doing with XML at Microsoft, check out our XML home page at the Microsoft Site Builder Network Web site (http://www.microsoft.com/xml/). Also, stay tuned to the Site Builder Network because we will be posting more information about how you can use XML today with Microsoft Internet Explorer version 4.0.

This stuff is so cool I can start to imagine the day when the Web will actually reduce information overload instead of adding to it!