Dr. GUI Does Data with XML

Dr. GUI

October 20, 1997

Last week the good doctor explained how to modify your ActiveX controls so they'll run great under Internet Explorer 4.0. This week, Dr. GUI would like to tackle something a bit more theoretical and future-oriented: XML.

If you're like the doc, you've been reading about XML for a couple of months now and are thoroughly confused about what it might be. Is it a replacement for HTML? A format for indexing Web sites? A way to specify push channels on the Web? A floor polish? A dessert topping? Or all of these?

Simply put, XML is a language that allows you to mark up data. So it's not a floor polish, nor a dessert topping, nor a replacement for HTML. But you can use XML to do the other things, and more.

While HTML and Dynamic HTML are getting better and better at being able to describe layouts and define user interfaces, there has not been a standard way to describe structured data on the Web until XML, formally known as eXtensible Markup Language. Although HTML and XML look fairly similar to one another, there is one crucial difference: while HTML is intended to describe how the document should look, XML describes the relationships among the data. For instance, tags in HTML control positioning, layout, fonts, and color. Tags in XML represent the type of the data enclosed in them. So a "typical" HTML tag might be <B> for bold, while a "typical" XML tag might be <CITY> to indicated that the tagged text is the name of a city.

Dr. GUI knows you're probably asking, "So if you can't control layout with it, why is XML interesting?"

The Network Today vs. the Network with XML (in Internet Explorer 4.0)

Today on the Internet (and other networks), there is a standard way of formatting presentation and UI (HTML) and for clients to send small amounts of data to a server (URLs), but no general standard format for data, especially not for rich data. XML provides a standard format language for rich data that complements HTML formatting for presentation and UI. This format can be used between any pair of computers, not just between server and browser. So, for instance, once a standard <APPOINTMENT> or <INVOICE> is defined in XML, any computer can understand a standard data record generated by any other, giving your data not only cross-platform compatibility but also cross-application compatibility. Dr. GUI is especially excited about exchanging patient records among his various medical practice management programs using XML.

Data and the XML data format

What are the characteristics of data that is moved from site to site or from server to client? First, let's assume that data has been "assembled" into complex structures such as a P.O. or a Bug Report or a Portfolio. This assembly is expensive and is typically done by middle-tier servers. In other words, we don't want to re-do this assembly at each stop the data makes. Next, any data format needs to support extensible annotation for common agreement, personalization, and search support in a Web community.

XML is an ideal standard for meeting these requirements, because:

So What Is XML?

XML, then, is just a standard for encoding data. Like HTML, XML is based on SGML. But XML is far simpler than SGML; as a result, you can build an XML parser in days, not months (whew!). Syntactically, it's like HTML, but with two key exceptions:

The set of tags is unlimited.

Containing tags may not overlap each other.

Here's an example of legal XML. Note that each tag is completely contained within its container; none of the tags overlap. This is technically a requirement of HTML as well, but since in practice most HTML is written with overlapping tags, no browser parsers enforce this requirement.

<Person>
<Name>Adam Bosworth</Name> <Title>General Manager</Title> <Age>42</Age>
</Person>

XML can intermix tags and text:

<Person><Name>Adam Bosworth</Name> is an <role>advocate</role> for <technology>XML</technology>

</Person>


Tags can't overlap. The following is illegal in XML because the <Person> and <KeyPoint> tags overlap:

<Person><Name>Adam</Name>
<Key Point><Heading>XML provides a data bus</Heading> </Person><More> . . . </More></KeyPoint>

Tags may be ended in one of three ways. As in HTML, <TAG> is ended by </TAG>. Since XML is strict about proper nesting, you can also end the innermost tag with </>, which is much simpler. Finally, <TAG></> can be shortened into <TAG/>. This is especially handy when the primary purpose of the tag is to set attributes, not enclose text or other tags, as in <NAME VALUE="Adam" />.

XML is normally encoded in Unicode or UTF-8. But you can encode it in any character set you like if you include a first line specifying the character encoding. For instance:

   <?XML Encoding="Windows-1250"?>

would support Eastern Europe.

If no such tag is encountered, the parser assumes UTF-8, but will recognize Unicode data if it's preceded by Unicode byte-order mark.

Namespaces

As of the last submission to the W3C, the semantics of tags can be qualified uniquely by using a namespace.

<?XML::Namespace href = "http://ofs/PO.dtd" as = "po"?>
<Order><ShipTo>Adam</ShipTo> <Amount>100</Amount>
<Items> <Item><Qty>6</Qty><Prod>E13</Prod></>
<Item><Qty>9</Qty><Prod>J14</Prod></> </Items>
</Order>

Tags from multiple namespaces may be mixed. This is essential if data for one grammar is being annotated.

Example

<?XML::Namespace href = "http://acme/stocks.dtd" as = "acme"?>
<?XML::Namespace href = "http://www/types.dtd" as = "types"?>
<acme::Stock> <Ticker>MSFT</Ticker><Price>
<types::Double>110.5</types::Double></Price></Ticker></Stock>

XML and validation

XML supports validating data formats by specifying a grammar for the entity. There are two ways to define rules for the document: DTDs and schemas.

A DTD is used to define a grammar for the tags and attributes. This syntax is going to be supported, but deprecated by Microsoft. It uses a special non-XML–based grammar that looks like the following:

<!doctype RootElement System "URL"[]> or if an internal DTD then
<!doctype RootElement [
<!ELEMENT author (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT book (title?, author+)>
]>

A schema is a much richer and more extensible way to describe the rules for the content of a document. It uses XML itself as a grammar. A schema is defined using a particular XML syntax, as follows:

<elementType id="author"> <cdata/>
</elementType>
<elementType id="title"> <cdata/>
</elementType>
<elementType id="book">
<elt href="#title" occurs="OPTIONAL"/>
<elt href="#author" occurs="ONEORMORE"/>
</elementType>

XML and Internet Explorer 4.0

XML can be treated as a data source in Internet Explorer 4.0. To do this, first include an applet or ActiveX control that functions as a data source object:

<Applet Class=com.ms.xml.dso.xmldso.class id="mystocks"> <Param name="url" value="http://..."> </Applet>

Then bind to this like any other DSO:

<TABLE DataSrc="#mystocks">
<TR> <TD DataFld="Ticker">
<TD DataFld="Price"></TR>
</TABLE>

XML data also can be provided inline to any applet or object. Example:

<Applet …> <?XML version="1.0"?>
<Stocks>
<Stock><Price>10</Price>   <Ticker>MSFT</Ticker></Stock>
<Stock><Ticker>ORCL</Ticker>   <Price>?</Price></Stock>
</Stocks>
</Applet>

You would use a set of Java classes to access the tree.

Finally, you can use the lightweight C++ parser object from JavaScript:

Document = new ActiveXObject("msxml");
Document.url = "http://…";
tree = Document.getRoot();
...

XML API

The W3C Document Object Model (DOM) controls the standard Object Model (also known as an API) to XML data.

Microsoft has shipped an interim parser on the Web that allows you to load the data into your own data structures or directly use the DOM API as it was when Microsoft cut the code. Microsoft will track the object model as it progresses at http://www.microsoft.com/standards/xml.

Current object model

The root object is the Document. It contains a tree of element objects.

Document.getRoot()

Returns the tree of element objects

Document.load(url)

Points to a new XML document

Document.createElement(type,tagname)

Returns an element of the appropriate type and tagname

Element.getchildren()

Returns the children of that element as collection

Element.getParent()

Returns a parent element

Element.getTagName()

Returns the tag type (that is, on <foo>7</foo> would return "foo")

Element.getText()

Returns the combined unmarked-up text of all children

Element.getType()

Returns the category (Element, Comment, PI, or Text)

Element.addChild(newchildelement, childbefore)

Warning   This has changed in the DOM to Element.append(newchildelement), Element.prepend(newchildelement), Element.insert(newchildelement,index).

Element.removeChild(childtoberemoved)

Returns the removed element

Warning   this is renamed in the DOM to Element.remove(childtoberemoved)

Element.setAttribute(name,value)

Sets the value of the attribute as a string or list of strings (the DTD can specify that the attribute value is a list of tokens)

Element.getAttribute(name)

Returns the value of the attribute as a string or list of strings (the DTD can specify that the attribute value is a list of tokens)

Element.removeAttribute(name)

Removes the specified attribute

Element.getAttributes()

Returns all the attributes in a collection (only works in Java currently)

What's Next?

The good doctor notes that XML is very much a work in progress. Microsoft is working with the W3C XML committee to standardize extensions to support standard namespaces for data and types, updates and synchronization, schemas, and queries. But even though it's changing, you can start getting experience with it now—you can, for instance, format data to be moved around the Web as XML by using the Java source Microsoft has provided free on the Web and provide us with feedback and bug reports. (Microsoft will be working on a C++ implementation of this parser as well.)

Dr. GUI would like to thank Adam Bosworth for giving the PDC talk on which this column is based.