Specification for XML-Data

Specification for XML-Data

Last updated: June 26, 1997

Authors:: Andrew Layman, Microsoft Corporation
Jean Paoli, Microsoft Corporation
Steve De Rose, Inso Corporation
Henry S. Thompson, University of Edinburgh
Acknowledgements:: We thank Paul Grosso (ArborText), Sharon Adler (Inso Corporation), Anders Berglund (Inso Corporation), François Chahuneau (AIS/Berger-Levrault), and Edward Jung (Microsoft) for their help and contributions to this proposal.

Contents

Abstract
1. Introduction
2. Examples of XML-Data
    2.1. Data
    2.2. Data about Other Data
    2.3. PICS-NG Labels
    2.4. Digital Signatures, Security, and Authentication
    2.5. Database Information
    2.6. Graph Structures
    2.7. Discontiguous Information (propertyOf)
    2.8. Schema
    2.9. Type Extension
    2.10. Schema Extension
3. XML-Data Schema
    3.1. schema
    3.2. elementType
    3.3. Relations
    3.4. Attributes
    3.5. intEntityDcl and extEntityDcl
    3.6. extDcls
    3.7. Type Extension
    3.8. Lexical Data Types
    3.9. Basic Semantic Data Types
4. Standard Vocabulary
5. Relations to Other Proposed Standards
6. Conclusion
Appendix A - The XML DTD for a Schema

Abstract

This document provides an initial proposal for a specification (XML-Data) for exchanging structured and networked data on the Web. This specification uses XML, the Extensible Markup Language, for describing data, as well as data about data. We expect XML-Data to be useful for a wide range of applications, such as describing database transfers, digital signatures, or remotely-located Web resources.

Back to contents

1. Introduction

The Internet holds the potential to integrate all information in a global network (with many private but integrated domains). It promises access to information any time and, with wireless technology, anywhere. Today, however, the Internet is merely an access medium to text and pictures. To actualize the Internet's potential, we need to add intelligent search, data exchange, adaptive presentation, and personalization. The Internet must go beyond setting an information access standard, and must set an information understanding standard, which means: a standard way of representing data so that software can better search, move, display, and otherwise manipulate information currently hidden in contextual obscurity.

XML is an important step in this direction. It offers a standard syntax for textual structure of tagged data, based on extensive industry and theoretical experience. Its lexical format easily depicts a tree structure. A tree is a natural format that is richer than a simple flat list, yet (compared to a generalized graph) also respectful of cognitive and data processing requirements for economy and simplicity.

Looking at this point in more detail, there are several ways of structuring data. One is a flat tagging system. In this system, sets of keywords are applied to data elements. This is a simple form of data structure, but it does not capture any relationships between the keywords.

A more advanced means of structuring information is a tree. A tree allows expression of subsumption, containment, or any other single (contextual) relationship such as "manages." Trees correspond to object-oriented class hierarchies, file system hierarchies, organizational hierarchies, and so forth. Trees are relatively easy to understand and to construct. Trees are efficient to process, and there is a linear (e.g., textual) structure that a program can parse incrementally and determine when it is finished. This makes trees particularly useful as a transmission format for asynchronous, distributed systems such as the Internet, and also for display purposes where the single relationship (usually visual containment) enables incremental display.

A still more elaborate structure is a directed graph. A graph allows expression of arbitrary binary relationships, that is, many relationships between two things. A graph can express subsumption, containment, and any number of other relationships simultaneously. It is therefore a superset of a tree. This makes graphs very expressive for real-world semantics, but it also makes them harder to understand, more difficult to construct, and less efficient to process than trees. There is no efficient linear (e.g., textual) structure of a graph that can be incrementally processed. Therefore, while they are particularly useful for representing (and instrumenting) the complete semantics of a system, they are typically not suitable for transmission, display, or immediate processing.

The tree structure has proved broadly implementable and easy to deploy, not just in theory but also widely in practice. Industrial implementations, in the SGML community and elsewhere, demonstrate its intrinsic quality and industrial strength, e.g., aircraft (ATA), automotive (J2008), banking (OFX), and semiconductors (Pinnacles PCIS).

This specification shows how to add a single convention to XML so that graph arcs are easily added into a lexical tree structure, without requiring decomposition of tree format into a "lowest common denominator" nodes-and-arcs structure.

XML-Data consists of a collection of related technologies. First, it unifies lexical trees with graph structures. Second, it builds on this to define a representation for schemata based on XML instance syntax. It offers a mechanism to organize element types into a hierarchy, and proposes a small set of basic types. Finally, it adds facilities for lexical typing and proposes a small collection of lexical types.

XML-Data can encode the content, semantics, and schemata for a range of cases, from simple and prosaic to complex and sophisticated:

An ordinary document
A structured record, such as an appointment record or purchase order
An object, with data and methods
A data record, such as the result set of a query
Information in a database or a Web site (e.g., CDF)
Graphical presentation (e.g., an application user interface)
Standard schema entities and types
All the links between information and people on the Web

The resulting flexibility of a single homogenous data representation system allows any reader to uniformly determine the structural semantics of a data element. Information can then be reused for new purposes and in novel contexts. For example, a record from a database of restaurants and a record from a client contact database might be reused in the context of an appointment, say in setting a lunch date with a client. The relationships between the restaurant and contact data do not reside in the schema data described by either database individually, but are extensions defined by the instance of the appointment.

This specification, building on the earlier Web Collections in XML proposal, shows how to use a single syntax for a broad range of data, using that syntax for data and schemata, permitting the expressiveness of graph data when such power is required, but retaining the benefits of lexical trees.

Back to contents

2. Examples of XML-Data

2.1. Data

The following example shows a simple order from a bookstore for several books, a record, and a cup of coffee.

<ORDER>
  <SOLD-TO>
    <PERSON><LASTNAME>Layman</LASTNAME>
            <FIRSTNAME>Andrew</FIRSTNAME>
    </PERSON>
  </SOLD-TO>
  <SOLD-ON>19970317</SOLD-ON>
  <ITEM>
    <PRICE>5.95</PRICE>
    <BOOK>
      <TITLE>Number, the Language of Science</TITLE>
      <AUTHOR>Dantzig, Tobias</AUTHOR>
    </BOOK>
  </ITEM>
  <ITEM>
    <PRICE>12.95</PRICE>
    <BOOK>
      <TITLE>Introduction to Objectivist Epistemology</TITLE>
      <AUTHOR>Rand, Ayn</AUTHOR>
    </BOOK>
  </ITEM>
  <ITEM>
    <PRICE>12.95</PRICE>
    <RECORD>
      <TITLE><COMPOSER>Tchaikovsky</COMPOSER>'s First Piano Concerto</TITLE>
      <ARTIST>Janos</ARTIST>
    </RECORD>
  </ITEM>
  <ITEM>
    <PRICE>1.50</PRICE>
    <COFFEE>
      <SIZE>small</SIZE>
      <STYLE>cafe macchiato</STYLE>
    </COFFEE>
  </ITEM>
</ORDER>

XML-Data is flexible enough to encode heterogeneous structures, for example books, records, and coffee all within one sales order. These different kinds of items do not need to all have the same internal parts. For example, books have titles, coffee generally doesn't. XML-Data allows values to be expressed as element content (for example, the book titles shown) or as an attribute (not shown here). XML-Data can appear in separate documents or within other documents (such as HTML pages).

Back to contents

2.2. Data about Other Data

XML-Data is suitable for complex, self-contained data structures such as the book order, and also for information such as the Channel Definition Format, which describes remotely-located Web resources, many of which are themselves data:

<CHANNEL>
  <ITEM HREF="http://www.zoosports.com/intro.htm" level="2" precache="NO">
    <A HREF="http://www.zoosports.com/page1.htm">This is a link to page 1.</A>
    <TITLE>Welcome to ZooSports!</TITLE>
    <ABSTRACT>ZooSports articles, news, and promotional offers</ABSTRACT>
  </ITEM>
  <SCHEDULE ENDDATE="1994-11-05">
    <INTERVALTIME DAY="1"/>
    <EARLIESTTIME HOUR="12"/>
    <LATESTTIME HOUR="18"/>
  </SCHEDULE>
</CHANNEL>

Back to contents

2.3. PICS-NG Labels

XML-Data can express PICS-NG Labels:

(This uses the Layman-Bray proposal for namespaces -- note that this link requires a W3C password.)

<xml>
  <xml:schema>
    <namespaceDcl href="http://purl.org/Schemas" name="purl"/>
    <namespaceDcl href="http://www.foo.com" name="foo"/>
  </xml:schema>
  <xml:data>
    <purl:description1 href="http://purl.color.org/document.html">
      <title>Light and Dark: A study of color</title>
      <subject><LCSH>
          <for>Color and Color Palettes</for></LCSH> </subject>
      <author> <foo:author>
                            <name>John Smith</name>
                            <affiliation>thedarkside</affiliation>
                            <email>john@thedarkside</email></foo:author>
               <foo:author>
                            <name>Smith, Jane Q.</name>
                            <affiliation>thelightregion</affiliation>
                            <email>jane@thelightregion</email></foo:author></purl:description1>
  </xml:data>
</xml>

Back to contents

2.4. Digital Signatures, Security, and Authentication

Returning to the bookstore example, this is the same order with a digital signature added. The structured nature of XML-Data makes it easy to sign whole elements or parts of them.

<ORDER>
  <dsig:DSIG>
    <MANIFEST>80183589575795589189518915</MANIFEST>
    <SIG href="http://XYX/Joe@company.com"/>
  </dsig:DSIG>
  <SOLD-TO>
    <PERSON><LASTNAME>Layman</LASTNAME>
            <FIRSTNAME>Andrew</FIRSTNAME>
    </PERSON>
  </SOLD-TO>
  <SOLD-ON>19970317</SOLD-ON>
  <ITEM>
    <PRICE>5.95</PRICE>
    <BOOK>
      <TITLE>Number, the Language of Science</TITLE>
      <AUTHOR>Dantzig, Tobias</AUTHOR>
    </BOOK>
  </ITEM>
  <ITEM>
    <PRICE>12.95</PRICE>
    <BOOK>
      <TITLE>Introduction to Objectivist Epistemology</TITLE>
      <AUTHOR>Rand, Ayn</AUTHOR>
    </BOOK>
  </ITEM>
  <ITEM>
    <PRICE>12.95</PRICE>
    <RECORD>
      <TITLE><COMPOSER>Tchaikovsky</COMPOSER>'s First Piano Concerto</TITLE>
      <ARTIST>Janos</ARTIST>
    </RECORD>
  </ITEM>
  <ITEM>
    <PRICE>1.50</PRICE>
    <COFFEE>
      <SIZE>small</SIZE>
      <STYLE>cafe macchiato</STYLE>
    </COFFEE>
  </ITEM>
</ORDER>

Back to contents

2.5. Database Information

While XML-Data can represent complex structures, it can also represent simple ones, for example, a simple list of database records:

<BOOK-MASTER-LIST>
  <BOOK id="book1">
    <TITLE>Number, the Language of Science</TITLE>
    <AUTHOR>Dantzig, Tobias</AUTHOR>
  </BOOK>

  <BOOK id="book2">
    <TITLE>Introduction to Objectivist Epistemology</TITLE>
    <AUTHOR>Rand, Ayn</AUTHOR>
  </BOOK>

  <BOOK id="book3">
    <TITLE>I, The Jury</TITLE>
    <AUTHOR>Spillane, Mickey</AUTHOR>
  </BOOK>

  <BOOK id="book4">
    <TITLE>Half Magic</TITLE>
    <AUTHOR>Eager, Edward</AUTHOR>
  </BOOK>

  <BOOK id="book5">
    <TITLE>QED</TITLE>
    <AUTHOR>Feynmann, Richard P.</AUTHOR>
  </BOOK>
</BOOK-MASTER-LIST>

Back to contents

2.6. Graph Structures

An XML-Data element may include links to resources outside the immediate tree. When it meets application needs, this href facility can be used to break up a single structure into multiple parts, with relations among them indicated by Universal Resource Identifier (URI) links. The references can be local or remote. In this example, they are inventory records from the database table we just looked at.

<ORDER id="order1">
   <dsig:DSIG>
     <MANIFEST>80183589575795589189518915</MANIFEST>
     <SIG href="http://XYX/Joe@company.com"/>
   </dsig:DSIG>
   <SOLD-TO>
      <PERSON><LASTNAME>Layman</LASTNAME>
              <FIRSTNAME>Andrew</FIRSTNAME>
      </PERSON>
    </SOLD-TO>
    <SOLD-ON>19970317</SOLD-ON>
    <ITEM href="http://bigbookstore.com/data/bookmaster?XML-XPTR=book1">
      <PRICE>5.95</PRICE>
    </ITEM>
    <ITEM href="http://bigbookstore.com/data/bookmaster?XML-XPTR=book2">
      <PRICE>12.95</PRICE>
    </ITEM>
    <ITEM href="http://bigbookstore.com/data/musicmaster?XML-XPTR=cd1">
      <PRICE>12.95</PRICE>
    </ITEM>
    <ITEM>
      <PRICE>1.50</PRICE>
      <COFFEE>
        <SIZE>small</SIZE>
        <STYLE>cafe macchiato</STYLE>
      </COFFEE>
    </ITEM>
</ORDER>

Notice that each of the ITEM elements establishes a relationship between the ORDER and a BOOK, and that the relationship itself has attributes, in this case the price at which the book was sold. Relations can have attributes and can contain elements, and the process can be carried to any needed level of detail.

Back to contents

2.7. Discontiguous Information (propertyOf)

Information about an element can be contained in the element, but also can sit outside it. For example, the following applies a digital signature to a sales order without actually modifying the order:

<dsig:DSIG>
  <xml:propertyOf href="http://bigbookstore.com/data/orders?XML-XPTR=order1"/>
  <MANIFEST>80183589575795589189518915</MANIFEST>
  <SIG href="http://XYX/Joe@company.com"/>
</dsig:DSIG>

Back to contents

2.8. Schema

Every data object, such as a purchase order, contains certain parts, such as sold-to, sold-on date, items, etc. We can write a formal description of what these parts are and which are allowed where. This is called a "schema" and is written using a form of XML-Data:

<xml:schema ID="BookOrderSchema">
  <!-- This schema is digitally signed. Schemas are a form of data,
       so they, too, can be signed. -->
  <dsig:DSIG>
    <MANIFEST >*(&#&$&@*$&%*&@*$&$*@</MANIFEST>
    <SIG href="http://XYX/Jane@company.com"/>
  </dsig:DSIG>

  <!-- Here are all the element types, their contents,
       attributes and relations. -->
  <elementType id="ORDER">
    <relation href="#SOLD-TO"/>
    <relation href="#SOLD-ON"/>
    <relation href="#ITEM" occurs="STAR"/>
  </elementType>
  <relationType id="SOLD-TO">
    <elt href="#PERSON"/>
  </relationType>
  <relationType id="SOLD-ON">  
    <pcdata/>
    <!-- Date is YYYYMMDD -->
    <attribute name="lextype" default="DATE.ISO8061" presence="fixed"/>
  </relationType>
  <elementType id="PERSON">
    <relation href="#LASTNAME"/>
    <relation href="#FIRSTNAME"/>
  </elementType>
  <elementType id="LASTNAME">
    <pcdata/>
  </elementType>
  <elementType id="FIRSTNAME">
    <pcdata/>
  </elementType>
  <relationType id="PRICE">
    <pcdata/>
  </relationType>
  <relationType id="ITEM">
    <any/>
    <relation href="#PRICE"/>
    <range href="#BOOK"/>
    <range href="#RECORD"/>
    <range href="#COFFEE"/>
  </relationType>
  <elementType id="BOOK">
    <relation href="#TITLE"/>
    <relation href="#AUTHOR"/>
  </elementType>
  <elementType id="RECORD">
    <relation href="#TITLE"/>
    <relation href="#ARTIST"/>
  </elementType>
  <relationType id="SIZE">
    <pcdata/>
  </relationType>
  <relationType id="STYLE">
    <pcdata/>
  </relationType>
  <elementType id="COFFEE">
    <relation href="#SIZE"/>
    <relation href="#STYLE"/>
  </elementType>
  <elementType id="TITLE">
    <mixed><elt href="#COMPOSER"/></mixed>
  </elementType>
  <relationType id="AUTHOR">
    <pcdata/>
  </relationType>
  <relationType id="ARTIST">
    <pcdata/>
  </relationType>
  <relationType id="COMPOSER">
    <pcdata/>
  </relationType>
</xml:schema>

Back to contents

2.9. Type Extension

Sometimes some elements are variants of others, in which case we can organize the element types into a genus-species hierarchy using the extends attribute:

<xml:schema ID="ArtSchema">
  <elementType id="artistic-work">
    <relation href="#TITLE"/>
  </elementType>
  <elementType id="BOOK" extends="#artistic-work">
    <relation href="#AUTHOR"/>
  </elementType>
  <elementType id="RECORD" extends="#artistic-work">
    <relation href="#ARTIST"/>
    <relation href="#COMPOSER" occurs="OPTIONAL"/>
  </elementType>
  <relationType id="AUTHOR">
    <pcdata/>
  </relationType>
  <relationType id="COMPOSER" extends="#AUTHOR"/>
  <relationType id="ARTIST">
    <pcdata/>
  </relationType>
</xml:schema>

Here we see that books and records are both types of artistic work, and that a composer is a type of author.

Back to contents

2.10. Schema Extension

We can use also use this ability to customize a schema that has useful features, but which is too general. In this example, we show a general schema for orders, and one that is customized for our bookstore:

<xml:schema ID="GenericOrderSchema">
  <elementType id="ORDER">
    <relation href="#SOLD-TO"/>
    <relation href="#SOLD-ON"/>
  </elementType>
  <relationType id="SOLD-TO">
    <elt href="#PERSON"/>
  </relationType>
  <elementType id="PERSON">
    <relation href="#LASTNAME"/>
    <relation href="#FIRSTNAME"/>
  </elementType>
  <relationType id="LASTNAME">
    <pcdata/>
  </relationType>
  <relationType id="FIRSTNAME">
    <pcdata/>
  </relationType>
</xml:schema>  


<xml:schema id="BookOrderSchema">
  <elementType id="ORDER" extends="http://generic.com/genericOrder?XML-XPTR=ID(ORDER)">
    <relation href="#ITEM" occurs="STAR"/>
  </elementType>

  <relationType id="ITEM">
    <any/>
    <relation href="http://generic.com/genericOrder?XML-XPTR=ID(ORDER)"/>
    <range href="http://art.com/schemata?XML-XPTR=ID(BOOK)"/>
    <range href="http://art.com/schemata?XML-XPTR=ID(RECORD)"/>
    <range href="#COFFEE"/>
  </relationType>

  <relationType id="SIZE">
    <pcdata/>
  </relationType>

  <relationType id="STYLE">
    <pcdata/>
  </relationType>

  <elementType id="COFFEE">
    <relation href="#SIZE"/>
    <relation href="#STYLE"/>
  </elementType>
</xml:schema>

Back to contents

3. XML-Data Schema

The XML-Data schema language defines element types, attributes, and relations, and which of these can be used in which combinations with others. It also provides features for organizing element types into a genus-species hierarchy, a basic set of element types, and a small set of lexical types. The schema contains other features from XML Document Type Definition (DTD) language, such as entity and notation declarations. The XML-Data schema is powerful enough to express the same structural information and constraints as XML DTDs. It covers all the features of XML DTDs. An XML DTD can be mechanically converted to an XML-Data schema.

Schemata are composed principally of declarations for:

element types, represented by elementType
attributes of elements, represented by attribute
relations among elements, represented by relationType
rules governing the valid combinations of the above, represented by any, mixed, and pcdata; also by ent, group, relation, and range.
internal and external entities, represented by intEntityDecl and extEntityDecl
notations, represented by notationDcl

Comments can be interspersed as usual in XML, and there is a provision for using references to external schemata or schema fragments.

Back to contents

3.1. The schema document element type: schema

All schema elements are contained within a schema element, like this:

<?XML version='1.0' rmd='all'?>
<!doctype schema SYSTEM "http://www.w3c.org/pub/sotr/schema.dtd">
<xml:schema id='ExampleSchema'>
  <!-- schema goes here. -->
</xml:schema>

Back to contents

3.2. The element type declaration element type: elementType

Key terms used here: element, elementType, empty, any, mixed, pcdata, content model.

The heart of an XML-Data schema is the elementType declaration, which defines a class of elements, gives them attributes, establishes a grammar of which other element types and character data are allowed in their contents, and defines their allowable relationships to elements of other classes. (The allowable content, including relations, is called "content model.")

<elementType id="example">  <!-- element example (p*) -->
    <elt href="#p" occurs="STAR"/>
</elementType>
<elementType id="p">       <!-- element p ((#PCDATA|p)*) -->
    <mixed><elt href="#p"/></mixed> 
</elementType>

The name attribute is optional if id is present, in which case the id is used as the name.

Within an elementType, elt indicates that instances are permitted to have only a single element type in their content. The occurs attribute of elt specifies whether this content is optional, and gives its cardinality.

Empty and any content are expressed using predefined elements empty and any. (Empty may be omitted. Any signals that any mixture of elements and parsed character data is legal.) Parsed character data content is similarly expressed with a pcdata item. Mixed content (a mixture of parsed character data and one or more element types) is identified by a mixed element, whose content identifies the element types allowed in addition to parsed character data (see below).

<elementType id="ARTIST">
  <pcdata/>
</elementType>

More complex content models are created using group:

<elementType id="animalFriends">
  <group groupType="OR" occurs="STAR">
    <group groupType="OR" occurs="PLUS">
      <elt href="#cat"/>
      <elt href="#dog"/>
    </group>
    <elt href="#bird"/>
    <elt href="#rabbit"/>
    <elt href="#pig"/>
    <elt href="#fish"/>
  </group>
</elementType>

Back to contents

3.3. Relations

Key terms used here: relationType, relation, XML-Link locator, href.

Relation element types express a relationship between one element (usually the relation's parent) and either another element or an atomic value (such as a simple number, string, or date). Relations use the XML-Link locator without implying navigation. The target of a relation is the element referenced by the href attribute if one is present, or else the element contents. This single convention unifies graphs and trees.

Including a relation in an elementType makes it an implicit part of that element's content model, with the default for occurs being OPTIONAL. Relations must occur (in a valid document instance) after any other content. RelationTypes are elements, and the full content model is as if there were a sequential group containing first the explicitly provided content model, then the relations in a starred or group with all the relations as content.

Two element types are used in the schema to effect a relation: The relationType is a specialized kind of elementType, while relation has the same function as elt (but validates that it refers to a relationType).

If a default attribute is specified for a relation, it becomes the default of the value attribute of the relation elt. The range element, if present, declares a restriction on the valid target of a relation. Each range element references one elementType; any of which are valid.

 <relationType id="favoriteFood" ><mixed/></relationType>
 <relationType id="chases" ><any/></relationType>

 <elementType id="dog" >
   <any/>
   <attribute name="name"/>
   <relation href="favoriteFood"/>
   <relation href="chases"/>
 </elementType>

Back to contents

3.4. Attributes

Key terms used here: attribute, attribute, values, default.

After the content model, attribute declarations may occur, which are divided into attributes with enumerated or notation values, and all other kinds.

<elementType id="p1">       <!-- element p1 ((#PCDATA|p1)*) -->
    <mixed><elt href="#p"/></mixed> 
    <attribute name='id' type='ID'/>  <!-- attlist p id ID=#IMPLIED
                                                        exm (a|b|c) 'c'
                                                        x CDATA FIXED 'y' -->
    <attribute name='exm' type='ENUMERATION' values='a b c' default='c'/>
    <attribute name='x' defType='FIXED' default='y'/>
</elementType>

An attribute may be given a default value. Whether it is required or optional is signaled by presence. (Presence ordinarily defaults to IMPLIED, but if omitted and there is an explicit default, presence is set to the SPECIFIED.)

Attributes with enumerated (and notation) values permit a values attribute, a space-separated list of legal values. The values attribute is required when the type is ENUMERATION or NOTATION, else it is forbidden. In these cases, if a default is specified, it must be one of the specified values.

Similar to the facility of multiple ATTLISTs, we sometimes need to have attributesDcls declared separately from the elementType they refer to. We can do this with the propertyOf element, discussed later.

Back to contents

3.5. The internal and external entity declaration element type: intEntityDcl and extEntityDcl

Key terms used here: entity, internal entity, external entity, notation.

This and the next two declarations cover entities in general. Entities are a powerful shorthand mechanism, similar to macros in a programming language.

<intEntityDcl name="LTG">
    <entityDef>Language Technology Group</entityDef>
</intEntityDcl>

<extEntityDcl name="dilbert">
    <notation href="#gif"/>
    <systemId href="http://www.ltg.ed.ac.uk/~ht/dilb.gif"/>
</extEntityDcl>

Here as elsewhere, following XML, systemId must be a URL, absolute or relative, and publicId, if present, must be a Public Identifier (as defined in ISO/IEC 9070:1991, Information technology -- SGML support facilities -- Registration procedures for public text owner identifiers). If a notation is given, it must be declared (see below) and the entity will be treated as binary, i.e., not substituted directly in place of references.

<notationDcl name="gif">
    <systemId href='http://who.knows.where/'/>
</notationDcl>

Back to contents

3.6. The external declarations element type: extDcls

Key terms used here: external entity with declarations.

Although we allow an external entity with declarations to be included, we recommend a different declaration for schema modularization. The extDcls declaration gives a clean mechanism for importing (fragments of) other schemata. It replaces the common SGML idiom of declaring an external parameter entity and then immediately referring to it, and has the same import, namely, that the text referred to by the combination of systemId and publicId is included in the schema in place of the extDcls element, and that replacement text is then subject to the same validity constraints and interpretation as the rest of the schema.

Back to contents

3.7. Type Extension

Key terms used here: type (class), typeOf, extension (inheritance, subclassing), implements, extends, typeOf (genus).

Schema of all types can benefit from a subtyping mechanism: indicating that one class of object is a specialization of another more general class. For example, cat and dog both have the type pet as their more general category. To make more effective use of such classes, we introduce one new schema attribute, which can be used to declare explicitly that an element type is a subclass of another: extends:

<xml:schema>
  <elementType id="animalFriends">
    <elt href="#pet" occurs="PLUS"/>
  </elementType>

  <elementType id="pet">
    <any/>
  </elementType>

  <elementType id="cat" extends="#pet"/>

  <elementType id="dog"  extends="#pet"/>

</xml:schema>

This schema says that the animalFriends element class can contain one or more elements from the pet class, such as a cat or a dog. Also, that each cat and dog instance is a pet (that is, any cat is semantically a pet, and any valid cat is also a valid pet). So the following data is now valid under this schema:

<animalFriends>
  <cat/>
  <dog/>
  <cat/>
</animalFriends>

Type Extension

It is frequently necessary to add new attributes to a subclass. This requires no extra machinery, because XML already permits multiple attribute list declarations, which cumulatively add attributes to element types. So each subclass may easily add any new attributes desired, as shown here:

<elementType id="dog" extends="#pet"/>
  <attribute name="age"/>
</elementType>

If the super type has content models (attributes, etc.), these are inherited, that is, they are also declared implicitly for the derived class. In the following example, we give an owner attribute to pet. This are inherited, so both cat and dog now also now have an owner attribute.

<xml:schema>
  <elementType id="animalFriends">
    <elt href="#pet" occurs="PLUS"/>
  </elementType>

  <elementType id="pet">
    <any/>
    <attribute id='name'/>
    <attribute id='owner'/>
  </elementType>

  <elementType id="cat" extends="#pet"/>
    <elt href='#kittens'/>
    <attribute id='lives' type='NMTOKEN'/>
  </elementType>

  <elementType id="dog" extends="#pet"/>
    <elt href='#puppies'/>
    <attribute id='breed'/>
  </elementType>
<xml:schema>

This schema says that the animalFriends element class can contain one or more pet elements. Because cat and dog are subtypes of pet, they can occur as well. So the following instance fragment is now valid under this schema:

<animalFriends>
  <cat name="Fluffy" lives='9'/>
  <pet name="Diego"/>
  <dog name="Gromit" owner='Wallace' breed='mutt'/>
</animalFriends>

Additional relations can also be added, but only if the content model of the superType consists of a single list of optional, repeatable element types.

When defining a derived element class, one can also override existing attributes and relations. The following example adds a Height relation and overrides the favoriteFood relation, giving it a default value of "Fish." (We also do something fancy here. Making this overridden element itself have its super type favoriteFood ensures that the derived element is in all other respects identical.)

<relationType id="height">
  <any/>
</relationType>

<relationType id="#favoriteCatFood" extends="#favoriteFood"/>

<elementType id="cat" extends="#pet"/>
  <relation href="#height"/>
  <relation href="#favoriteCatFood" default="Fish"/>
</elementType>

Schema Extension

We can also use subtyping to extend an existing schema without editing it. Suppose that we cannot edit the schema defining pet, cat, or dog, but want to use elements with those names and semantics in our document. The following adds the "eyeColor" property to cat.

<relationType id="eyeColor" extends="http://whereever.org/#eyeColor">
    <pcdata/>
</relationType>

<elementType id="cat" extends="http://whereever.org/#cat"/>
  <relation href="#eyeColor"/>
</elementType>

The rules for allowable subtyping must enforce certain constraints, which are, in principle, that a subtype can have additional relations and attributes (provided this is consistent with the super type's content model, but never fewer) and can add restrictions (but never relax them). In practice, this principle leads to rules, such as: default values can be added if there are none, changed, or converted to FIXED if DEFAULT.

Implements

Subtyping as we have described it here is actually a combination of two effects: First, we assert that an element of one type is also of another (as in a cat is a pet). Second, we achieve economies and maintainability in the declarations to make sure that the first is true. That is, the derived element class is automatically provided with all the properties of the super type. Sometimes it is valuable to have the first effect without the second. (This is equivalent to the Java implements facility.) We indicate this by using the implements element:

<relationType id="favoriteFood">
  <mixed/>
</relationType>

<relationType id="weight">
  <mixed/>
</relationType>

<elementType id="cat">
  <implements href="http://whereever.org/#pet"/>
  <attribute name="name"/>
  <relation href="#favoriteFood"/>
  <relation href="#weight"/>
</elementType>

This has no effect on the attributes or relations of instances of cat, but asserts in the schema that every cat is also a pet (that is, any cat is semantically a pet, and any valid cat is also a valid pet).

Relation of Type Extension to Parameter Entities

Sophisticated DTDs often make complex use of parameter entities in an attempt to consolidate common structures in one reusable place. Such parameter entities often represent implicit classes.

The need is real, but the approach often leads to obscurity and reduced maintainability. Further, expansion of entities loses all connection with their source: once expanded, the fact that some set of element types was a co-declared set, re-used in multiple places, is lost.

Back to contents

3.8. Lexical Data Types

Information, such as dates and numbers, is often expressed in a format that requires some further parsing. For example, the same date can be written "October 22, 1954" or "19541022" (and from what I've seen, about 300 other ways). The lextype attribute discriminates formats. Appearing on instance elements, it describes the format of the remainder of the element. The value of the lextype attribute is always by reference to a URI identifying the parsing rules. XML-Data should define a small number of these. We propose NUMBER, INTEGER, REAL, and DATE.ISO8061.

<birthday lextype="DATE.ISO8061">19541022</birthday>

These are declared in the schema as follows:

<relationType id="birthday">
  <attribute name="lextype" default="DATE.ISO8061" presence="fixed"/>
</relationType>

When giving the lexical type of an attribute in the schema, lextypeIs is used, as in:

<attribute name="price" presence="REQUIRED" lextypeIs="number"/>

Some patterns will indicate that several properties or attributes should be used in combination to arrive at a value. For example, a custom pattern could indicate a date expressed as the following:

<relationType id="birthday">
  <attribute name="lextype" default="DATE.ATTR-YMD" presence="specified"/>
</relationType>
...
<birthday year="1954" month="10" day="22" >

Back to contents

3.9. Basic Semantic Data Types

We need to define here a small number of basic types and their hierarchy, corresponding to simple data types such as Number and Date. (Dates are a subtype of numbers.)

We also need to define the expression of each of the basic Java and SQL data types in terms of these basic ones, plus additional properties giving units, precision, min, max, default pattern, and other properties. For example, an INTEGER typically is a number with certain min and max property values. Note that units should be an element type with possible structure, so that things like "miles/hours" or "feet/(sec*sec)" can be represented and used for automatic conversions.

Back to contents

4. Standard Vocabulary

We expect standard libraries of vocabulary to be developed to capture common semantics used in vertical applications and particularly in industry and application domains. Dublin Core and CDF are two examples of such standard libraries.

Back to contents

5. Relations to other proposed standards

(Note that the links below require a W3C password.)

The W3C site at http://www.w3.org/PICS/Member/NG contains links to several related papers, including Ora Lassila's PICS-NG document , Renato Ianella's small PICS extension proposal, CDF, MCF in XML, and the Web Collections using XML proposal. Specific notes on some of these follow.

5.1 XML-LINK

All relations use href in a manner consistent with the XML-LINK working draft dated April 6, 1997 (the most recent at the time of this writing). XML-Links are a type of relation (with extra attributes, elements, and semantics indicating traversal).

5.2 PICS-NG

PICS-NG Metadata Model and Label Syntax describes a set of requirements for structured data to be used on the Internet. XML-Data is an application of XML concepts to those requirements.

5.3 CDF

The Channel Definition Format (CDF) is a natural application of XML-Data and is fully compatible with the syntax and the ideas presented in this document. Its format is a validatable grammar given a proper schema. The existing use of href in CDF is consistent with XML-LINK and XML-Data usage. CDF defines a number of basic element types that would be appropriate for a standard library.

Back to contents

6. Conclusion

Future applications of the Internet will focus on adding user value to information through semantic annotation. Semantics will permit information to be discovered, targeted, reused, and integrated. Not only does this make the content more usable, but it opens up opportunities for software developers to build components that exploit these semantics. Such components could include applications as prosaic as application or user logging, or as futuristic as user agents that assist in finding or organizing contents, World Wide Web "surf buddies" that accompany a user's browsing and add valuable or entertaining comments, or natural language query systems. Semantic annotation turns the Internet into a platform for programming powerful and valuable applications.

This specification lays the foundation for how applications can annotate their information content. It adds powerful, new constructs for representing semantics, and is sufficiently advanced for use in artificial intelligence and natural language systems, yet retains the architecture and investment of existing XML and the efficiency of its representation.

Back to contents

Appendix A - The XML DTD for a schema


<!ENTITY % nodeattrs 'id ID #IMPLIED'>
<!-- href is as per XML-LINK, but is not required unless there is
      no content -->

<!ENTITY % exattrs   'extends CDATA #IMPLIED'>

<!ENTITY % linkattrs 'id ID #IMPLIED
                      href CDATA #IMPLIED'>

<!-- The shared content model of elementType, linkType and relationType -->
<!-- Omitted element type same as "empty." -->
<!ENTITY % extendedmodel 'implements*,
                          (elt|group|empty|any|pcdata|mixed)?,
                          (relation|attribute)*'>

<!-- The top-level container -->
<!element schema         ((elementType|propertyOf|linkType|
                          relationType|extendType|augmentElementType|
                          intEntityDcl|extEntityDcl|
                          notationDcl|extDcls|namespaceDcl)*)>
<!attlist schema %nodeattrs;>

<!-- Element Type Declarations -->
<!element elementType   (%extendedmodel)>
<!-- Either name or id must be present - - absent name defaults to id -->
<!attlist elementType %nodeattrs;
                      %exattrs;
                name    CDATA      #IMPLIED>

<!-- Element types allowed in content model -->
<!-- Note this is just short for a model group with only one elt in it -->
<!element elt           EMPTY>
<!-- Elements can have exponents as well as groups -->
<!-- The href is required -->
<!attlist elt   %linkattrs;
                occurs     (required|optional|star|plus) 'required'>

<!-- A group in a content model, sequential or disjunctive -->
<!element group         ((group|elt)+)>
<!attlist group         %nodeattrs;
                groupType (seq|or) 'seq'
                occurs  (required|optional|plus) 'required'>

<!element any           EMPTY>
<!element empty         EMPTY>
<!element pcdata	EMPTY>

<!-- mixed content is just a flat, non-empty list of elts -->
<!-- We don't need to say anything about #pcdata, it's implied -->
<!element mixed         (elt+)>
<!attlist mixed         %nodeattrs;> 

<!-- Attributes -->
<!-- default value must be present if presence is specified or fixed -->
<!-- presence defaults to specified if default is present, else implied -->
<!-- name attribute is locally unique, defaults to id if absent -->
<!element attribute  EMPTY>
<!attlist attribute  %linkattrs;
                name    CDATA #IMPLIED
                type    (id|idref|idrefs|entity|entities|nmtoken|nmtokens|
                         enumeration|notation|cdata) 'cdata'
                default CDATA #IMPLIED
                values NMTOKENS #IMPLIED
                presence (implied|specified|required|fixed) #IMPLIED 
                lextypeIs CDATA #IMPLIED>

<!-- Relations - - relationTypes are pointed to from relations,
            just as elementTypes are pointed to from elts -->
<!element relationType  (%extendedmodel;,
                         range*)>
<!attlist relationType  %nodeattrs;
                        %exattrs;
                        name CDATA #IMPLIED>

<!element range  EMPTY>
<!attlist range %linkattrs;>

<!element relation  EMPTY>
<!attlist relation  %linkattrs;
                    default CDATA #IMPLIED
                    occurs (required|optional|star|plus) 'optional'>

<!-- For adding attributes to existing element types -->
<!element propertyOf    EMPTY>
<!attlist propertyOf    href CDATA #REQUIRED>

<!element augmentElementType ((relation|attribute)*)>
<!attlist augmentElementType %linkattrs;
                             %exattrs;>

<!-- Shorthand for simple XML-LINKs -->
<!element linkType (%extendedmodel;)>
<!attlist linkType %nodeattrs;
                   %exattrs;
                   name CDATA #IMPLIED
                   role CDATA #IMPLIED
                   title CDATA #IMPLIED
                   show (embed|replace|new) #IMPLIED
                   actuate (auto|user) #IMPLIED
                   behaviour CDATA #IMPLIED>
<!element implements EMPTY>
<!attlist implements href CDATA #REQUIRED>

<!-- Entity Declarations -->
<!-- Note: as this is written, only external entities
      can have structure without escaping it -->
<!-- Name defaults to id if absent -->
<!element intEntityDcl     (#PCDATA)>
<!attlist intEntityDcl %nodeattrs;
                name    CDATA #IMPLIED>

<!-- The entity will be treated as binary if a notation is present -->
<!-- systemID and publicId (if present) must have the required syntax -->
<!element extEntityDcl    ( systemId, publicId?)>
<!attlist extEntityDcl %nodeattrs;
                name    CDATA #IMPLIED
		notation CDATA #IMPLIED>

<!-- Pointers for above -->
<!element systemID      EMPTY>
<!attlist systemID      %linkattrs;>
<!-- Must be empty if href is used -->
<!element publicID      (#PCDATA)>
<!attlist publicID      %linkattrs;>

<!-- Notation Declarations -->
<!-- systemID and publicId (if present) must have the required syntax -->
<!element notationDcl        (systemId, publicId?)>
<!attlist notationDcl   %linkattrs;
                name    CDATA #IMPLIED>

<!-- External entity with declarations to be included -->
<!-- systemID and publicId (if present) must have the required syntax -->
<!element extDcls       empty>
<!attlist extDcls
                systemId CDATA #REQUIRED
                publicId CDATA #IMPLIED>

<!-- Namespace Declarations -->
<!-- systemID and publicId (if present) must have the required syntax -->
<!element namespaceDcl  EMPTY>
<!attlist namespaceDcl  %linkattrs;
                name    CDATA #IMPLIED>

Back to contents