Valid XML and DTDs
Peter Coxhead [1]
Contents
1 Validity
2 Document Type Declaration
3 Element Declarations
3.1 The Any Content Model
3.2 The Empty Content Model
3.3 Element Only Content Models
3.4 Mixed Content Models
4 Attribute Declarations
4.1 Enumerated List Attribute Type
4.2 Name Token Attribute Type
4.3 Multi-Name Token Attribute Type
4.4 String Attribute Type
4.5 Identifier Attribute Type
4.6 Identifier Reference Attribute Type
4.7 Multi-Identifier Reference Attribute Type
5 Entities
6 External DTDs
7 Namespaces
8 Other Schema Languages
References/Bibliography
1 Validity
A well-formed XML document is, essentially, one whose syntax is correct according to the XML specification (W3C 2006). Only well-formed XML documents can be handled by XML processors. However, being well-formed doesn't guarantee that an XML document has the contents it should have. Consider an XML document which lists all employees. We are likely to want every employee in the document to have a first name and family name. Thus although the following XML document is well-formed, it doesn't meet our requirements:
<?xml version="1.0"?><employees><employee manager="e3257"><firstname>Joe</firstname><familyname>Bloggs</familyname><comment>Due to retire soon.</comment></employee><employee id="e3257"><familyname>Patel</familyname><comment/></employee></employees>
We would probably also like to ensure that every employee has an ID, but Joe Bloggs doesn't.
We could rely on an XML processor to carry out the necessary checks. Java has a library which makes reading and processing XML quite easy, so we could write a Java program which read in the XML document (provided it was well-formed of course) and applied a series of checks. Experience shows that this isn't the right approach:
- Writing programs (algorithms) to perform checks is tedious and error-prone.
- It doesn't produce a specification of the expected structure of the XML document which people can read and use as a 'template' to create XML.
What we need is something like a 'type specification' for an XML document. For example, XHTML is a version of HTML which is written as XML; however, it must be more than well-formed, it must also be interpretable by a web browser, which determines which elements and attributes it contains, how they are ordered, etc. As another example, MathML is a way of writing mathematical formulae in XML; to be acceptable to a MathML processor, the XML document must conform to the MathML specification.
If we can specify the type of an XML document, then XML processors can check not only whether the document is well-formed, but also whether it is valid, i.e. whether it matches the specification for that type. An XML processor which performs such a check is called a validating processor. Note that an XML document can be well-formed but not valid, but can't be valid but not well-formed. An XML type specifies a whole class of XML documents that conform to that type. A document that does conform to the type is called an instance of the type.
The type can include a specification of which elements are allowed in the document, which elements can, should or must be nested inside which other elements, which attributes are allowed, not allowed or required in each element, and much more besides. Many possible errors in XML documents can be caught without the programmer having to write any code, and creators of XML documents have a clear specification of what precisely they must do in order to meet the requirements of the type.
There are a number of different systems for specifying the type of an XML document. They can all be called schema languages (because they describe, in a sense, the shape of an XML document). (However, this can cause some confusion because there is one specific XML schema language called the W3C XML Schema Language.) The simplest system, and the one built into the XML specification (W3C 2006), is called document type definition, or DTD. Some other schema languages will be briefly described in §7 below.
2 Document Type Declaration
To make an XML document valid it can be associated with a Document Type Definition or DTD. The document type declaration is where this happens. Every valid XML document must have (at least) one, which must come after the XML declaration (if there is one) but before the first (root) element in the document.
The simplest (but not usually the best) approach is to embed the DTD into the XML file. In this case the document type declaration actually contains the DTD. Thus we could expand the 'employees' XML document presented above to the following:
<?xml version="1.0"?><!DOCTYPE employees [<!ELEMENT employees (employee*)><!ELEMENT employee (firstname, familyname, comment)><!ATTLIST employee id ID #REQUIREDmanager IDREF #IMPLIED><!ELEMENT firstname (#PCDATA)><!ELEMENT familyname (#PCDATA)><!ELEMENT comment (#PCDATA)>]><employees><employee manager="e3257"><firstname>Joe</firstname><familyname>Bloggs</familyname><comment>Due to retire soon.</comment></employee><employee id="e3257"><familyname>Patel</familyname><comment/></employee></employees>
Although this document is well-formed, it is now INVALID, since it does not conform to the DTD, which specifies that:
- The
employeeelement has a list of attributes, of which the first,id, is required while the second,manager, is optional; however, the firstemployeeelement does not have anidattribute. - The
employeeelement must always contain the three sub-elementsfirstname,familynameandcomment, in exactly that order; however the secondemployeeelement does not contain afirstnameelement.
We will now look at the components of DTDs in turn.
3 Element Declarations
Element type declarations are of the form:
<!ELEMENT element-name content-rule >
The element-name, is, of course, just the
name of the element that the rule applies to, while the
content-rule specifies the constraints that
apply to it. There are a number of different forms for the
content-rule, called content models.
3.1 The Any Content Model Here the content-rule has the value ANY, and the corresponding element in the XML document can have any mix of character data and elements in any order. If no DTD is provided for an XML document, this is the default assumed for every element. Clearly this content model should be avoided if at all possible, since it effectively avoids specifying the format of the element. The following declaration says that every a element may have any content:
<!ELEMENT a ANY>
3.2 The Empty Content Model Here the content-rule has the value EMPTY, and the corresponding element in the XML document must contain no content, neither sub-elements nor character data. Thus the following declaration says that every a element must have no content:
<!ELEMENT a EMPTY>
3.3 Element Only Content Models Here an expression involving element names and some special characters (representing operators) is used to describe patterns of child elements that can occur in the content of the target element. The expressions are built up recursively from a number of components. Expressions must be fully parenthesised in order to exclude any possibility of ambiguity or the need to specify the order of precedence of pattern operators.
- Element Name: an element name means that an element of that
name is required at this position. The following declaration says that an
element named
amust have precisely one child element whose name isb:
<!ELEMENT a (b)>
Technically, an element name can only appear inside sequences or choices, both of which must be enclosed in parentheses. Hence the parentheses around the
bin the example above are because it is actually the only entry in a sequence of length 1. Thus the following is INVALID:
<!ELEMENT a b>
(The reason for this restriction is so that there can be elements named
ANYandEMPTYwithout conflicting with the corresponding content models.)
- Sequence: a comma is used to separate expressions which must
appear in the order given. The following declaration says that an element named
amust have precisely three child elements whose names areb,candd:
<!ELEMENT a (b, c, d)>
Recall that every sequence must be enclosed in parentheses.
- Choice: a vertical bar
|is used to separate expressions of which any one must appear. The following says that an element namedamust have precisely one child element whose name is eitherb,cord:
<!ELEMENT a (b | c | d)>
- Zero or More Repeats: an asterisk
*is used to follow a sequence, a choice or an element name which must be repeated 0 or more times. The following says that an element namedamust have 0 or morebchild elements (and nothing else):
<!ELEMENT a (b*)>
Note that this is exactly the same as writing:
<!ELEMENT a (b)*>
- One or More Repeats: a plus
+is used to follow a sequence, a choice or an element name which must be repeated 1 or more times. The following says that an element namedamust have at least onebchild element, but possibly many (and nothing else):
<!ELEMENT a (b+)>
- Optionality: the question mark
?is used to follow a sequence, a choice or an element name which is optional, i.e. may or may not occur. The following says that an element namedamay or may not have a singlebchild element, but must contain no other elements:
<!ELEMENT a (b?)>
By combining these descriptions, very complex constraints can be constructed:
<!ELEMENT cv(preface, (qualification | experience)+, hobbies?, referee*)>
Here the cv element must start with a preface
element. This must be followed by a repeating group that has to occur at least
once. Each cycle of this group can contain either a qualification
element or an experience element; i.e. there can be a sequence of
qualification and experience elements in any order and
of any length greater than 0. Next there might or might not be a single
hobbies element. Finally there can be a sequence of any number
(including zero) of referee elements.
3.4 Mixed Content Models Here we can finally specify that an element contains character data, and how it can be mixed with child elements. There are two forms:
- If element
acontains character data but no child elements at all, it is declared by:
<!ELEMENT a (#PCDATA)>
#PCDATAhistorically comes from 'Parsed Character Data', but is not a good name since it corresponds to what in XML is called just 'character data'.
- If element
acontains any number of the elementsb,c,d, ... in any order, interleaved with any amount of character data, it is declared by:
<!ELEMENT a (#PCDATA | b | c | d | ...)*>
These are the ONLY kinds of mixed content models allowed. There's no way, for example, of specifying that an element starts with character data but then ends with a given element. For this reason, the second form of mixed content model should be avoided unless it really is the case that the child elements can occur anywhere within character data.
Neither form provides any way of specifying what kind of character data can occur, e.g. that it must be an integer or consist of exactly three words.
4 Attribute Declarations
Attribute Type Declarations are of the form:
<!ATTLIST elem-name att-name1 att-type1 att-status1att-name2 att-type2 att-status2... ... ... >
The elem-name identifies the element to
which this list of attribute declarations applies. The
att-name identifies which particular
attribute of the element is being declared. The values of
att-type and
att-status then specify the form of the
attribute.
The att-status part of the declaration is
the easiest to understand. It specifies whether the attribute is required or
optional and, if optional, whether a default value should be assumed if not
given in the XML document.
#REQUIREDmeans that the attribute must be present in every target element in the XML document; hence there is no need for a default value.#IMPLIEDmeans that the attribute is optional in every target element in the XML document (i.e. may or may not be present), but that no default value is provided if the attribute is omitted. (I findIMPLIEDto be an odd name for this case;OPTIONALwould surely have been better.)"att-value"means that the attribute is optional in every target element in the XML document, but that if it is not present then it must be added by a validating XML processor and given the value ofatt-value(without the quotes).#FIXED "att-value"means that if the attribute is present in a target element in the XML document then it must have precisely the given value; if it is not present then it must be added by a validating XML processor and given the value ofatt-value(without the quotes).
The last two forms create a problem. A non-validating XML processor is only required to check that an XML document is well-formed; it is not required to process the DTD fully in order to check whether the document is valid. However, if it doesn't fully process the DTD then it may not be able add default or fixed attributes, so that if the processor constructs a tree to represent the XML document (see e.g. "Introduction to XML", §9), a validating XML processor may create a different tree to that created by a non-validating XML processor.
Returning to the att-type part of the
attribute declaration, there are ten different forms, of which seven fall within
the scope of this module. Each will be described in turn.
4.1 Enumerated List Attribute Type This form is used to specify that the value of the attribute can only be one of a fixed set of values given in a vertical bar separated list of names. The list must be enclosed in parentheses. Note that a value must be a valid name (see e.g. "Introduction to XML", §1). Some examples:
<!ATTLIST employee status (fulltime | parttime) "fulltime">
This specifies that employee elements must have the attribute
status, which can only have one of the two values
fulltime or parttime. If the attribute is omitted,
then a validating XML processor must add it with the value
fulltime. If the overwhelming majority of employees are full time,
then an attribute declaration like this avoids the need to include
status="fulltime" in most of the employee elements.
However, it usually makes the eventual code written to process the XML document
more complex, since it will be safer to determine whether an employee is full
time or not by testing for either the absence of the status
attribute or its having the value fulltime.
<!ATTLIST menu cuisine (French | Italian | Indian | Chinese) #REQUIRED>
This specifies that menu elements must have the attribute
cuisine, which can only have one of the values French,
Italian, Indian or Chinese. If the
attribute is omitted, then a validating XML processor must report an error.
4.2 Name Token Attribute Type The keyword NMTOKEN is used to indicate this type, which is used to specify that the value of the attribute can be any arbitrary 'name token'. A name token is like a name, but can in addition start with a digit. In the example above, if it had been decided to use a single word to refer to a type of cuisine, but it was required to be able to extend the types of cuisine to more than a fixed list, the appropriate declaration would be:
<!ATTLIST menu cuisine NMTOKEN #REQUIRED>
This would allow the following XML to be valid:
<menu cuisine="Greek"> ... </menu>
but not:
<menu cuisine="Greek and Italian"> ... </menu>
4.3 Multi-Name Token Attribute Type The keyword NMTOKENS is used to indicate this type, which is used to allow an attribute to have a white space separated list of values, each of type NMTOKEN. Thus if we declare the cuisine attribute of menu via:
<!ATTLIST menu cuisine NMTOKENS #REQUIRED>
then the following XML is valid:
<menu cuisine="Greek Italian"> ... </menu>
(Actually the value "Greek and Italian" would also be valid, but
the spirit of this attribute type is to have a list of name tokens each with its
own meaning.)
4.4 String Attribute Type The keyword CDATA is used to indicate this type. As with the keyword #PCDATA for elements, the name is historic, but misleading. Attributes cannot have general character data values. They cannot contain CDATA sections, so it's particularly odd to use CDATA as the keyword. CDATA means that the attribute can have any string value. For example:
<!ATTLIST purchase ordernumber CDATA #REQUIREDcustomerid CDATA #IMPLIEDseller CDATA #FIXED "ACME INC."priority CDATA "normal">
All of the four attributes of a purchase element can have string
values (and so, for example, can contain white space). The attribute
ordernumber must be present, whereas customerid can be
omitted, but has no default value if it is. The value of seller is
always the string 'ACME INC.'. The attribute priority
may be omitted but if so defaults to normal.
It should be clear that CDATA offers only a very weak
specification for an attribute value, and so should be avoided if possible. The
order number and customer ID are likely to be numbers in a real application.
DTDs don't provide any way of specifying this. However NMTOKEN is
likely to be a better choice than CDATA, since it rules out
embedded white space. If the attribute priority is to be allowed to
have values like rush order then an enumerated list can't be used
(because its values can only be names) nor
can NMTOKEN, and CDATA is the only choice. However, it
would probably be better instead to use values like rushOrder or
rush_order, which allow the more restricted NMTOKEN as
the attribute type. Only the attribute seller above really needs to
be CDATA.
4.5 Identifier Attribute Type The keyword ID is used to indicate this type. It means that the attribute is an identifier for the target element and so can be used in references (see e.g. "Introduction
to XML", §5). Note that:
- No default values can be specified for identifier attributes.
Thus the only
att-statusdeclarations that can be used for these types are#REQUIREDand#IMPLIED. - Identifier attribute values must be names, not arbitrary
strings; in particular, they cannot be numbers, so we have to use something like
C245892rather than245892to make a customer number into an identifier. - No element can have more than one identifier attribute declared for it.
- No two different elements in an XML document can have the same identifier attribute value assigned to them, even if the elements and attributes have different names.
Example:
<!ATTLIST book bookid ID #REQUIREDpublicationdate CDATA #IMPLIED >
4.6 Identifier Reference Attribute Type The keyword IDREF is used to indicate this type. It means that the attribute is used to refer to an element in the current XML document which has an attribute of type ID, i.e. that the attribute is used to make a reference. Note that:
- Any of the four
att-statusdeclarations can be used with this attribute type. - Just as with
IDattribute types, the values forIDREFattributes must be names, not arbitrary strings, since the values must be those used for an attribute of typeID. - An element can have multiple different attributes of type
IDREF(which will usually be used to refer to different kinds of element, but don't have to). - For every attribute of type
IDREFin a valid XML document, there must exist an element in the document with an attribute of typeIDwhose attribute value matches that of theIDREFattribute. This makes sure that every attribute of typeIDREFwhich has a value actually refers to an element.
For example:
<!ATTLIST division divisionID ID #REQUIRED><!ATTLIST employee employeeID ID #REQUIREDdivision IDREF #REQUIREDmanager IDREF #IMPLIED>
These two declarations ensure that division and
employee elements have unique identifiers (IDs),
divisionID and employeeID respectively. An employee
has to have an ID reference value in a division attribute (e.g. to
say that he or she is employed in that division), and may have an ID reference
value in a manager attribute (e.g. to say that the employee with
that employeeID is his or her manager). Notice that the only
requirement enforced by validity is that the value of an attribute of type
IDREF is the value of an attribute of type ID in the
same XML document. Validity checks don't prevent us from erroneously putting an
employee ID value in the division attribute, or a division ID value
in the manager attribute.
Attributes of type IDREF cannot be used to make references to
elements in other XML documents, since the validity check will fail. Generally,
the NMTOKEN type must be used for this purpose.
4.7 Multi-Identifier Reference Attribute Type The keyword IDREFS is used to indicate this type. The value of the attribute is a list of white space separated IDREF values, each with the same constraints as a single identifier reference. For example:
<!ATTLIST employee employeeID ID #REQUIREDmanager IDREF #IMPLIEDsubordinates IDREFS #IMPLIED>
Every employee may have a single manager and may have multiple subordinates. A corresponding valid XML document might be:
<?xml version="1.0"?><!DOCTYPE employees [<!ELEMENT employees (employee*)><!ELEMENT employee EMPTY><!ATTLIST employee employeeID ID #REQUIREDmanager IDREF #IMPLIEDsubordinates IDREFS #IMPLIED>]><employees><employee employeeID="E3247" manager="E8012" /><employee employeeID="E3248" manager="E8012" /><employee employeeID="E3249" manager="E8012" /><employee employeeID="E8012" subordinates="E3247 E3248 E3249" /></employees>
5 Entities
Entities are a somewhat complicated part of the XML standard, but we will only cover a very small part of them in this module. Essentially, entities allow pieces of text, and even fragments of XML or DTDs, to be associated with entity names. Then whenever an entity reference appears in the XML document, the XML processor substitutes the associated text.
The simplest kind of entity ('an internal general entity') is declared using the form:
<!ENTITY entity-name "entity-value">
The entity is then referred to by writing
&entity-name; in character data or in the
value of an attribute.
Two common uses are:
- To avoid numeric character references. For example, the
Unicode character Å can be inserted into an XML document by using the
character reference
Å. More memorable is to define an entity:
<!ENTITY Aring "Å">
Then
Åwill be expanded by a validating XML processor into the value ofÅ, namely the character Å.Recall that five entities are predefined in XML (see e.g. "Introduction to XML", §8), with the names
lt,gt,amp,quoteandapos.
- To simplify and standardize entering 'boilerplate' text. For example, an XML document might contain frequent uses of the address of a company. Rather than entering it in full each time, an entity could be defined, e.g.:
<!ENTITY companyAddress "Acme Inc., Lookout Hill, Never Never Land">
The reference
&companyAddress;will then be expanded to the specified string (provided that the XML processor reads the DTD in which this entity is defined).
Advanced use of entities, together with another similar advanced feature (conditionals) and the overriding of entity and attribute declarations, allow sophisticated parameterisation of XML documents and DTDs, which can help to make them more robust, flexible and easier to maintain, although at some significant cost in complexity to the reader of the raw XML files.
6 External DTDs
So far we have assumed that the document type declaration embeds the DTD in the XML document. However, this limits its use to one specific document, whereas one of the main uses of a DTD is to define an XML document type, of which there can be many instances. Clearly we don't want to repeat the DTD inside each instance.
The solution is to provide an external document type declaration. The simplest such declaration has the form:
<!DOCTYPE root-element-name SYSTEM "URL-of-DTD-document">
As with an internal document type declaration, the
root-element-name must correspond to the name
of the root element in the document. The quoted
URL-of-DTD-document gives the URL of the
external DTD. The URL can use the http: protocol, the
file: protocol, or may be a relative reference (in which case it
should be resolved relative to the XML document in which it appears). A
validating XML processor must retrieve the DTD document and use it to validate
the XML document.
In many cases, the external DTD may not need to be retrieved. For example, a browser processing XHTML (HTML written as valid XML) normally won't actually read the DTD for XHTML, but will have the logic of this DTD embedded in its code. However, since there are different versions of XHTML, it does need to know which DTD the XHTML is based on. The solution is to provide an external public document type declaration:
<!DOCTYPE root-element-name PUBLIC "public-id" "URL-of-DTD-document">
The public-id has no required special
syntax, although a rather complicated convention has developed. In practice, all
that is really important is that the name chosen should be unlikely to clash
with any name for any other public DTD identifier. The URL is provided as a
backup, to be used only if the XML processor does not know the public ID or
cannot resolve it through some means (how it does this is not part of the XML
standard).
As an example, an XHTML document may contain the declaration:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
The public ID is sufficient to tell any processing program that the content
of the document corresponds to 'XHTML 1.0 Strict' (the EN refers to
English). It's extremely unlikely that such a program will want to read and use
the document at
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd,
although all 979 lines of it are available there if needed.
It's also possible to have a mixed external/internal document type declaration. Either a System or a Public external declaration can have an internal DTD added to it, e.g.:
<!DOCTYPE root-element-name PUBLIC "public-id" "URL-of-DTD-document"[ internal-DTD-subset-content ]>
The ordering rules for processing multiple DTDs ensure that the internal entity and attribute declarations override those in external DTDs (element declarations cannot be overridden). This allows an external DTD to be modified for a specific XML document.
7 Namespaces
Suppose two or more types of XML document are in use, based on different schemas. For example, the products sold by a company might be described by a 'product XML' schema. The advertising department might then decide to develop a 'brochure XML' to use in representing brochures independently of the eventual medium of distribution. Clearly it would be sensible to be able to use 'product XML' inside 'brochure XML'.
Namespaces in XML (W3C 2006b) were designed to allow this kind of combination. They have two main advantages:
- They give an XML processor an easy way to decide which type of XML an element belongs to, and thus make it easy to process elements of different types differently.
- They deal the problem of 'name clashes' which occur when elements in one type of XML have the same name as elements in another.
The basic idea is to associate every name with a
namespace. A namespace is an abstract concept,
which identifies the type of the XML concerned. Since XML documents may be
publicly available (e.g. via a web server), a type identifier should be unique
in the world. The solution is to use a URI reference as the identifier for a
namespace. So for example, a company might use
http://www.companyname.com/products as the identifier for the
namespace of its 'product XML' type. Note that there doesn't have to be anything
at this location. Using a URI is just a practical way of being sure that a name
is unique. When used as namespaces, URIs are simply treated as strings and
compared for equality character by character.
Every XML name is then 'really' made up of the pair (namespace name, local name). In the XML documents we have used so far, where we didn't declare a namespace, the namespace name was effectively just the empty string.
To declare that all XML names occurring within an element,
including the name of the element itself, are part of a
particular namespace, a special attribute xmlns is attached to the
element. For example:
<brochure xmlns="http://www.companyname.com/brochure"><title ...>...</brochure>
An XML processor must then treat the two XML names in the fragment above as
'expanded names', i.e. the pairs
(http://www.companyname.com/brochure,
brochure),
(http://www.companyname.com/brochure, title).
To include some XML from a different namespace, we simply add another
xmlns attribute in the appropriate place. For example:
<brochure xmlns="http://www.companyname.com/brochure"><title> ... </title><product xmlns="http://www.companyname.com/product"><title> ... </title></product></brochure>
The first title element now has the full expanded name
(http://www.companyname.com/brochure, title), the
second the full expanded name (http://www.companyname.com/product,
title).
This style of declaring namespace names is fine when there are large 'blocks' of XML in one namespace or another. It's tedious if for some reason we need to mix up XML from different namespaces:
...<title xmlns="http://www.companyname.com/brochure"> ... </title><source xmlns="http://www.companyname.com/product"> ... </source><author xmlns="http://www.companyname.com/brochure"> ... </author><price xmlns="http://www.companyname.com/product"> ... </price>...
The solution is to associate a 'temporary' XML name with a namespace, and
then use this name as a prefix to create a
qualified name. A qualified name consists of a
prefix, a colon, and the local name. The prefix is declared by using an
attribute made up of xmlns, a colon, and the prefix name. For
example:
<br:brochure xmlns:br="http://www.companyname.com/brochure"xmlns:pr="http://www.companyname.com/product"><br:title> ... </br:title><pr:product><pr:title> ... </pr:title></pr:product></br:brochure>
Then the name of the element br:title expands to
(http://www.companyname.com/brochure, title), whereas
the name of the element pr:title expands to
(http://www.companyname.com/product, title).
The two styles can be mixed. One namespace can be declared as the default for that element, i.e. without a prefix, and any others can be declared as requiring prefixes. For example:
<brochure xmlns="http://www.companyname.com/brochure"xmlns:pr="http://www.companyname.com/product"><title> ... </title><pr:product><pr:title> ... </pr:title></pr:product></brochure>
In every case, the actual name used as a prefix is arbitrary. It's just a temporary reference within that XML element to the URI which is the name of the namespace.
A significant problem with using XML namespaces at present is that they were introduced after DTDs, with the result that DTDs do not easily support them. For this reason, we will not use namespaces in any serious way in this module. However, you need to be aware of them and recognize them when used in XML.
7 Other Schema Languages
DTDs are a relatively simple, yet powerful way of providing a 'schema' for XML documents. They have some drawbacks; in particular:
- DTDs are very limited in the data types they can specify. Character data and attribute values cannot be constrained to be integers, reals, booleans, valid dates, etc., yet such constraints are often required.
- DTDs do not allow conditional constraints. It's not possible
to specify relationships among attributes, for example, such as saying that if
an
employeeelement has the valuemanagerfor thestatusattribute then there must be asubordinatesattribute. - DTDs are not expressed in XML, but in their own unique notation. This makes it awkward for an XML processor which generates new XML to generate an appropriate DTD. DTDs require their own separate parsers and are not easily manipulated.
Many alternative XML schema languages have been proposed, but as of now, none has come to dominate the others and thus replace DTDs. Three seem to have achieved sufficient momentum to be widely used and available and are actively supported.
- W3C XML Schema Language is very powerful but also very complex. Its particular strength is its support for the typing of attribute values and character data; it contains a large number of primitive types that can be used (including the usual integer, date, etc.) as well as allowing the user to define new, bespoke data types. It also has excellent support for the validation of parent-child relationships. However, its complexity has constrained the development of supporting tools so that defining a schema in this language requires considerable expertise. Thus it can be considered to be a heavyweight solution that is appropriate for highly engineered XML application design, but is somewhat too complex and labour intensive for smaller programming situations.
- RELAX NG is a relatively recent entry into the schema language wars. It is far simpler and cleaner than the W3C XML Schema language. Its special strength is that it is much easier to read and write. It can specify anything that the W3C Schema language can, with the exception that it cannot define new, bespoke, complex data types for attribute values and character data.
- Schematron uses the XPath language to refer to components of an XML document. XPath allows navigation through an XML document, using 'trails' of parents, children, siblings, etc. Schematron can be used to specify complex constraints that are not specifiable with the other schema languages. However, since it treats as valid anything that doesn't conflict with the constraints it specifies (in contrast with the other languages which only allow what they specify), many constraints usually have to be written to match a few lines of a DTD. Furthermore, like DTDs, Schematron provides no data type support for character data or attribute values.
References/Bibliography
See the other handouts for this module, which are available online.
Elliotte Rusty Harold (2003). Effective XML: 50 Specific Ways to Improve Your XML. Addison Wesley. 0-321-15040-6.
W3C (2006a). The Extensible Markup Language (XML) 1.0
(Fourth Edition).
http://www.w3.org/TR/2006/REC-xml-20060816/.
W3C (2006b). Namespaces in XML 1.0 (Second
Edition)
http://www.w3.org/TR/2006/REC-xml-names-20060816.
Footnotes
[1] This document is based on an original by Alan Sexton © 2007, modified by permission.