Valid XML and DTDs

Peter Coxhead [1]

Contents

 1 Validity
 2 Document Type Declaration
 3 Element Declarations
 3.1 The Any Content Model
 3.2 The Empty Content Model
 3.3 Element Only Content Models
 3.4 Mixed Content Models
 4 Attribute Declarations
 4.1 Enumerated List Attribute Type
 4.2 Name Token Attribute Type
 4.3 Multi-Name Token Attribute Type
 4.4 String Attribute Type
 4.5 Identifier Attribute Type
 4.6 Identifier Reference Attribute Type
 4.7 Multi-Identifier Reference Attribute Type
 5 Entities
 6 External DTDs
 7 Namespaces
 8 Other Schema Languages
  References/Bibliography

1 Validity

A well-formed XML document is, essentially, one whose syntax is correct according to the XML specification (W3C 2006). Only well-formed XML documents can be handled by XML processors. However, being well-formed doesn't guarantee that an XML document has the contents it should have. Consider an XML document which lists all employees. We are likely to want every employee in the document to have a first name and family name. Thus although the following XML document is well-formed, it doesn't meet our requirements:

<?xml version="1.0"?>
<employees>
  <employee manager="e3257">
    <firstname>Joe</firstname><familyname>Bloggs</familyname>
    <comment>Due to retire soon.</comment>
  </employee>
  <employee id="e3257">
    <familyname>Patel</familyname>
    <comment/>
  </employee>
</employees>

We would probably also like to ensure that every employee has an ID, but Joe Bloggs doesn't.

We could rely on an XML processor to carry out the necessary checks. Java has a library which makes reading and processing XML quite easy, so we could write a Java program which read in the XML document (provided it was well-formed of course) and applied a series of checks. Experience shows that this isn't the right approach:

What we need is something like a 'type specification' for an XML document. For example, XHTML is a version of HTML which is written as XML; however, it must be more than well-formed, it must also be interpretable by a web browser, which determines which elements and attributes it contains, how they are ordered, etc. As another example, MathML is a way of writing mathematical formulae in XML; to be acceptable to a MathML processor, the XML document must conform to the MathML specification.

If we can specify the type of an XML document, then XML processors can check not only whether the document is well-formed, but also whether it is valid, i.e. whether it matches the specification for that type. An XML processor which performs such a check is called a validating processor. Note that an XML document can be well-formed but not valid, but can't be valid but not well-formed. An XML type specifies a whole class of XML documents that conform to that type. A document that does conform to the type is called an instance of the type.

The type can include a specification of which elements are allowed in the document, which elements can, should or must be nested inside which other elements, which attributes are allowed, not allowed or required in each element, and much more besides. Many possible errors in XML documents can be caught without the programmer having to write any code, and creators of XML documents have a clear specification of what precisely they must do in order to meet the requirements of the type.

There are a number of different systems for specifying the type of an XML document. They can all be called schema languages (because they describe, in a sense, the shape of an XML document). (However, this can cause some confusion because there is one specific XML schema language called the W3C XML Schema Language.) The simplest system, and the one built into the XML specification (W3C 2006), is called document type definition, or DTD. Some other schema languages will be briefly described in §7 below.

2 Document Type Declaration

To make an XML document valid it can be associated with a Document Type Definition or DTD. The document type declaration is where this happens. Every valid XML document must have (at least) one, which must come after the XML declaration (if there is one) but before the first (root) element in the document.

The simplest (but not usually the best) approach is to embed the DTD into the XML file. In this case the document type declaration actually contains the DTD. Thus we could expand the 'employees' XML document presented above to the following:

<?xml version="1.0"?>
 
<!DOCTYPE employees [
  <!ELEMENT employees (employee*)>
  <!ELEMENT employee (firstname, familyname, comment)>
  <!ATTLIST employee id ID #REQUIRED
                     manager IDREF #IMPLIED>
  <!ELEMENT firstname (#PCDATA)>
  <!ELEMENT familyname (#PCDATA)>
  <!ELEMENT comment (#PCDATA)>
]>
 
<employees>
  <employee manager="e3257">
    <firstname>Joe</firstname><familyname>Bloggs</familyname>
    <comment>Due to retire soon.</comment>
  </employee>
  <employee id="e3257">
    <familyname>Patel</familyname>
    <comment/>
  </employee>
</employees>

Although this document is well-formed, it is now INVALID, since it does not conform to the DTD, which specifies that:

We will now look at the components of DTDs in turn.

3 Element Declarations

Element type declarations are of the form:

<!ELEMENT element-name content-rule >

The element-name, is, of course, just the name of the element that the rule applies to, while the content-rule specifies the constraints that apply to it. There are a number of different forms for the content-rule, called content models.

3.1 The Any Content Model Here the content-rule has the value ANY, and the corresponding element in the XML document can have any mix of character data and elements in any order. If no DTD is provided for an XML document, this is the default assumed for every element. Clearly this content model should be avoided if at all possible, since it effectively avoids specifying the format of the element. The following declaration says that every a element may have any content:

<!ELEMENT a ANY>

3.2 The Empty Content Model Here the content-rule has the value EMPTY, and the corresponding element in the XML document must contain no content, neither sub-elements nor character data. Thus the following declaration says that every a element must have no content:

<!ELEMENT a EMPTY>

3.3 Element Only Content Models Here an expression involving element names and some special characters (representing operators) is used to describe patterns of child elements that can occur in the content of the target element. The expressions are built up recursively from a number of components. Expressions must be fully parenthesised in order to exclude any possibility of ambiguity or the need to specify the order of precedence of pattern operators.

<!ELEMENT a (b)>

Technically, an element name can only appear inside sequences or choices, both of which must be enclosed in parentheses. Hence the parentheses around the b in the example above are because it is actually the only entry in a sequence of length 1. Thus the following is INVALID:

<!ELEMENT a b>

(The reason for this restriction is so that there can be elements named ANY and EMPTY without conflicting with the corresponding content models.)

<!ELEMENT a (b, c, d)>

Recall that every sequence must be enclosed in parentheses.

<!ELEMENT a (b | c | d)>
<!ELEMENT a (b*)>

Note that this is exactly the same as writing:

<!ELEMENT a (b)*>
<!ELEMENT a (b+)>
<!ELEMENT a (b?)>

By combining these descriptions, very complex constraints can be constructed:

<!ELEMENT cv
          (preface, (qualification | experience)+, hobbies?, referee*)>

Here the cv element must start with a preface element. This must be followed by a repeating group that has to occur at least once. Each cycle of this group can contain either a qualification element or an experience element; i.e. there can be a sequence of qualification and experience elements in any order and of any length greater than 0. Next there might or might not be a single hobbies element. Finally there can be a sequence of any number (including zero) of referee elements.

3.4 Mixed Content Models Here we can finally specify that an element contains character data, and how it can be mixed with child elements. There are two forms:

<!ELEMENT a (#PCDATA)>

#PCDATA historically comes from 'Parsed Character Data', but is not a good name since it corresponds to what in XML is called just 'character data'.

<!ELEMENT a (#PCDATA | b | c | d | ...)*>

These are the ONLY kinds of mixed content models allowed. There's no way, for example, of specifying that an element starts with character data but then ends with a given element. For this reason, the second form of mixed content model should be avoided unless it really is the case that the child elements can occur anywhere within character data.

Neither form provides any way of specifying what kind of character data can occur, e.g. that it must be an integer or consist of exactly three words.

4 Attribute Declarations

Attribute Type Declarations are of the form:

<!ATTLIST elem-name att-name1 att-type1 att-status1
                    att-name2 att-type2 att-status2
                       ...      ...         ...    >

The elem-name identifies the element to which this list of attribute declarations applies. The att-name identifies which particular attribute of the element is being declared. The values of att-type and att-status then specify the form of the attribute.

The att-status part of the declaration is the easiest to understand. It specifies whether the attribute is required or optional and, if optional, whether a default value should be assumed if not given in the XML document.

The last two forms create a problem. A non-validating XML processor is only required to check that an XML document is well-formed; it is not required to process the DTD fully in order to check whether the document is valid. However, if it doesn't fully process the DTD then it may not be able add default or fixed attributes, so that if the processor constructs a tree to represent the XML document (see e.g. "Introduction to XML", §9), a validating XML processor may create a different tree to that created by a non-validating XML processor.

Returning to the att-type part of the attribute declaration, there are ten different forms, of which seven fall within the scope of this module. Each will be described in turn.

4.1 Enumerated List Attribute Type This form is used to specify that the value of the attribute can only be one of a fixed set of values given in a vertical bar separated list of names. The list must be enclosed in parentheses. Note that a value must be a valid name (see e.g. "Introduction to XML", §1). Some examples:

<!ATTLIST employee status (fulltime | parttime) "fulltime">

This specifies that employee elements must have the attribute status, which can only have one of the two values fulltime or parttime. If the attribute is omitted, then a validating XML processor must add it with the value fulltime. If the overwhelming majority of employees are full time, then an attribute declaration like this avoids the need to include status="fulltime" in most of the employee elements. However, it usually makes the eventual code written to process the XML document more complex, since it will be safer to determine whether an employee is full time or not by testing for either the absence of the status attribute or its having the value fulltime.

<!ATTLIST menu cuisine (French | Italian | Indian | Chinese) #REQUIRED>

This specifies that menu elements must have the attribute cuisine, which can only have one of the values French, Italian, Indian or Chinese. If the attribute is omitted, then a validating XML processor must report an error.

4.2 Name Token Attribute Type The keyword NMTOKEN is used to indicate this type, which is used to specify that the value of the attribute can be any arbitrary 'name token'. A name token is like a name, but can in addition start with a digit. In the example above, if it had been decided to use a single word to refer to a type of cuisine, but it was required to be able to extend the types of cuisine to more than a fixed list, the appropriate declaration would be:

<!ATTLIST menu cuisine NMTOKEN #REQUIRED>

This would allow the following XML to be valid:

<menu cuisine="Greek"> ... </menu>

but not:

<menu cuisine="Greek and Italian"> ... </menu>

4.3 Multi-Name Token Attribute Type The keyword NMTOKENS is used to indicate this type, which is used to allow an attribute to have a white space separated list of values, each of type NMTOKEN. Thus if we declare the cuisine attribute of menu via:

<!ATTLIST menu cuisine NMTOKENS #REQUIRED>

then the following XML is valid:

<menu cuisine="Greek Italian"> ... </menu>

(Actually the value "Greek and Italian" would also be valid, but the spirit of this attribute type is to have a list of name tokens each with its own meaning.)

4.4 String Attribute Type The keyword CDATA is used to indicate this type. As with the keyword #PCDATA for elements, the name is historic, but misleading. Attributes cannot have general character data values. They cannot contain CDATA sections, so it's particularly odd to use CDATA as the keyword. CDATA means that the attribute can have any string value. For example:

<!ATTLIST purchase ordernumber CDATA #REQUIRED
                   customerid  CDATA #IMPLIED
                   seller      CDATA #FIXED "ACME INC."
                   priority    CDATA "normal">

All of the four attributes of a purchase element can have string values (and so, for example, can contain white space). The attribute ordernumber must be present, whereas customerid can be omitted, but has no default value if it is. The value of seller is always the string 'ACME INC.'. The attribute priority may be omitted but if so defaults to normal.

It should be clear that CDATA offers only a very weak specification for an attribute value, and so should be avoided if possible. The order number and customer ID are likely to be numbers in a real application. DTDs don't provide any way of specifying this. However NMTOKEN is likely to be a better choice than CDATA, since it rules out embedded white space. If the attribute priority is to be allowed to have values like rush order then an enumerated list can't be used (because its values can only be names) nor can NMTOKEN, and CDATA is the only choice. However, it would probably be better instead to use values like rushOrder or rush_order, which allow the more restricted NMTOKEN as the attribute type. Only the attribute seller above really needs to be CDATA.

4.5 Identifier Attribute Type The keyword ID is used to indicate this type. It means that the attribute is an identifier for the target element and so can be used in references (see e.g. "Introduction to XML", §5). Note that:

Example:

<!ATTLIST book bookid          ID    #REQUIRED
               publicationdate CDATA #IMPLIED >

4.6 Identifier Reference Attribute Type The keyword IDREF is used to indicate this type. It means that the attribute is used to refer to an element in the current XML document which has an attribute of type ID, i.e. that the attribute is used to make a reference. Note that:

For example:

<!ATTLIST division divisionID  ID    #REQUIRED>
 
<!ATTLIST employee employeeID  ID    #REQUIRED
                   division    IDREF #REQUIRED
                   manager     IDREF #IMPLIED>

These two declarations ensure that division and employee elements have unique identifiers (IDs), divisionID and employeeID respectively. An employee has to have an ID reference value in a division attribute (e.g. to say that he or she is employed in that division), and may have an ID reference value in a manager attribute (e.g. to say that the employee with that employeeID is his or her manager). Notice that the only requirement enforced by validity is that the value of an attribute of type IDREF is the value of an attribute of type ID in the same XML document. Validity checks don't prevent us from erroneously putting an employee ID value in the division attribute, or a division ID value in the manager attribute.

Attributes of type IDREF cannot be used to make references to elements in other XML documents, since the validity check will fail. Generally, the NMTOKEN type must be used for this purpose.

4.7 Multi-Identifier Reference Attribute Type The keyword IDREFS is used to indicate this type. The value of the attribute is a list of white space separated IDREF values, each with the same constraints as a single identifier reference. For example:

<!ATTLIST employee employeeID   ID     #REQUIRED
                   manager      IDREF  #IMPLIED
                   subordinates IDREFS #IMPLIED>

Every employee may have a single manager and may have multiple subordinates. A corresponding valid XML document might be:

<?xml version="1.0"?>
 
<!DOCTYPE employees [
  <!ELEMENT employees (employee*)>
  <!ELEMENT employee EMPTY>
  <!ATTLIST employee employeeID   ID     #REQUIRED
                     manager      IDREF  #IMPLIED
                     subordinates IDREFS #IMPLIED>
]>
 
<employees>
  <employee employeeID="E3247" manager="E8012" />
  <employee employeeID="E3248" manager="E8012" />
  <employee employeeID="E3249" manager="E8012" />
  <employee employeeID="E8012" subordinates="E3247 E3248 E3249" />
</employees>

5 Entities

Entities are a somewhat complicated part of the XML standard, but we will only cover a very small part of them in this module. Essentially, entities allow pieces of text, and even fragments of XML or DTDs, to be associated with entity names. Then whenever an entity reference appears in the XML document, the XML processor substitutes the associated text.

The simplest kind of entity ('an internal general entity') is declared using the form:

<!ENTITY entity-name "entity-value">

The entity is then referred to by writing &entity-name; in character data or in the value of an attribute.

Two common uses are:

<!ENTITY Aring "&#xC5;">

Then &Aring; will be expanded by a validating XML processor into the value of &#xC5;, namely the character Å.

Recall that five entities are predefined in XML (see e.g. "Introduction to XML", §8), with the names lt, gt, amp, quote and apos.

<!ENTITY companyAddress "Acme Inc., Lookout Hill, Never Never Land">

The reference &companyAddress; will then be expanded to the specified string (provided that the XML processor reads the DTD in which this entity is defined).

Advanced use of entities, together with another similar advanced feature (conditionals) and the overriding of entity and attribute declarations, allow sophisticated parameterisation of XML documents and DTDs, which can help to make them more robust, flexible and easier to maintain, although at some significant cost in complexity to the reader of the raw XML files.

6 External DTDs

So far we have assumed that the document type declaration embeds the DTD in the XML document. However, this limits its use to one specific document, whereas one of the main uses of a DTD is to define an XML document type, of which there can be many instances. Clearly we don't want to repeat the DTD inside each instance.

The solution is to provide an external document type declaration. The simplest such declaration has the form:

<!DOCTYPE root-element-name SYSTEM "URL-of-DTD-document">

As with an internal document type declaration, the root-element-name must correspond to the name of the root element in the document. The quoted URL-of-DTD-document gives the URL of the external DTD. The URL can use the http: protocol, the file: protocol, or may be a relative reference (in which case it should be resolved relative to the XML document in which it appears). A validating XML processor must retrieve the DTD document and use it to validate the XML document.

In many cases, the external DTD may not need to be retrieved. For example, a browser processing XHTML (HTML written as valid XML) normally won't actually read the DTD for XHTML, but will have the logic of this DTD embedded in its code. However, since there are different versions of XHTML, it does need to know which DTD the XHTML is based on. The solution is to provide an external public document type declaration:

<!DOCTYPE root-element-name PUBLIC "public-id" "URL-of-DTD-document">

The public-id has no required special syntax, although a rather complicated convention has developed. In practice, all that is really important is that the name chosen should be unlikely to clash with any name for any other public DTD identifier. The URL is provided as a backup, to be used only if the XML processor does not know the public ID or cannot resolve it through some means (how it does this is not part of the XML standard).

As an example, an XHTML document may contain the declaration:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
               "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

The public ID is sufficient to tell any processing program that the content of the document corresponds to 'XHTML 1.0 Strict' (the EN refers to English). It's extremely unlikely that such a program will want to read and use the document at
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd, although all 979 lines of it are available there if needed.

It's also possible to have a mixed external/internal document type declaration. Either a System or a Public external declaration can have an internal DTD added to it, e.g.:

<!DOCTYPE root-element-name PUBLIC "public-id" "URL-of-DTD-document"
[ internal-DTD-subset-content ]>

The ordering rules for processing multiple DTDs ensure that the internal entity and attribute declarations override those in external DTDs (element declarations cannot be overridden). This allows an external DTD to be modified for a specific XML document.

7 Namespaces

Suppose two or more types of XML document are in use, based on different schemas. For example, the products sold by a company might be described by a 'product XML' schema. The advertising department might then decide to develop a 'brochure XML' to use in representing brochures independently of the eventual medium of distribution. Clearly it would be sensible to be able to use 'product XML' inside 'brochure XML'.

Namespaces in XML (W3C 2006b) were designed to allow this kind of combination. They have two main advantages:

The basic idea is to associate every name with a namespace. A namespace is an abstract concept, which identifies the type of the XML concerned. Since XML documents may be publicly available (e.g. via a web server), a type identifier should be unique in the world. The solution is to use a URI reference as the identifier for a namespace. So for example, a company might use http://www.companyname.com/products as the identifier for the namespace of its 'product XML' type. Note that there doesn't have to be anything at this location. Using a URI is just a practical way of being sure that a name is unique. When used as namespaces, URIs are simply treated as strings and compared for equality character by character.

Every XML name is then 'really' made up of the pair (namespace name, local name). In the XML documents we have used so far, where we didn't declare a namespace, the namespace name was effectively just the empty string.

To declare that all XML names occurring within an element, including the name of the element itself, are part of a particular namespace, a special attribute xmlns is attached to the element. For example:

<brochure xmlns="http://www.companyname.com/brochure">
  <title ...>
  ...
</brochure>

An XML processor must then treat the two XML names in the fragment above as 'expanded names', i.e. the pairs (http://www.companyname.com/brochure, brochure),
(http://www.companyname.com/brochure, title).

To include some XML from a different namespace, we simply add another xmlns attribute in the appropriate place. For example:

<brochure xmlns="http://www.companyname.com/brochure">
  <title> ... </title>
  <product xmlns="http://www.companyname.com/product">
    <title> ... </title>
  </product>
</brochure>

The first title element now has the full expanded name
(http://www.companyname.com/brochure, title), the second the full expanded name (http://www.companyname.com/product, title).

This style of declaring namespace names is fine when there are large 'blocks' of XML in one namespace or another. It's tedious if for some reason we need to mix up XML from different namespaces:

  ...
  <title xmlns="http://www.companyname.com/brochure"> ... </title>
  <source xmlns="http://www.companyname.com/product"> ... </source>
  <author xmlns="http://www.companyname.com/brochure"> ... </author>
  <price xmlns="http://www.companyname.com/product"> ... </price>
  ...

The solution is to associate a 'temporary' XML name with a namespace, and then use this name as a prefix to create a qualified name. A qualified name consists of a prefix, a colon, and the local name. The prefix is declared by using an attribute made up of xmlns, a colon, and the prefix name. For example:

<br:brochure xmlns:br="http://www.companyname.com/brochure"
             xmlns:pr="http://www.companyname.com/product">
  <br:title> ... </br:title>
  <pr:product>
    <pr:title> ... </pr:title>
  </pr:product>
</br:brochure>

Then the name of the element br:title expands to
(http://www.companyname.com/brochure, title), whereas the name of the element pr:title expands to (http://www.companyname.com/product, title).

The two styles can be mixed. One namespace can be declared as the default for that element, i.e. without a prefix, and any others can be declared as requiring prefixes. For example:

<brochure xmlns="http://www.companyname.com/brochure"
          xmlns:pr="http://www.companyname.com/product">
  <title> ... </title>
  <pr:product>
    <pr:title> ... </pr:title>
  </pr:product>
</brochure>

In every case, the actual name used as a prefix is arbitrary. It's just a temporary reference within that XML element to the URI which is the name of the namespace.

A significant problem with using XML namespaces at present is that they were introduced after DTDs, with the result that DTDs do not easily support them. For this reason, we will not use namespaces in any serious way in this module. However, you need to be aware of them and recognize them when used in XML.

7 Other Schema Languages

DTDs are a relatively simple, yet powerful way of providing a 'schema' for XML documents. They have some drawbacks; in particular:

Many alternative XML schema languages have been proposed, but as of now, none has come to dominate the others and thus replace DTDs. Three seem to have achieved sufficient momentum to be widely used and available and are actively supported.

References/Bibliography

See the other handouts for this module, which are available online.

Elliotte Rusty Harold (2003). Effective XML: 50 Specific Ways to Improve Your XML. Addison Wesley. 0-321-15040-6.

W3C (2006a). The Extensible Markup Language (XML) 1.0 (Fourth Edition).
http://www.w3.org/TR/2006/REC-xml-20060816/.

W3C (2006b). Namespaces in XML 1.0 (Second Edition)
http://www.w3.org/TR/2006/REC-xml-names-20060816.

Footnotes

[1] This document is based on an original by Alan Sexton © 2007, modified by permission.

GoHome Page for "Information and the Web"