Introduction to XML

Peter Coxhead [1]

Contents

 1 Basic XML Syntax
 2 XML Processors
 3 Elements
 4 Attributes
 5 'Reference' Attributes
 6 Elements versus Attributes
 7 Well-formedness
 8 Escaping Characters
 9 White Space Handling
 10 Comments
 11 Processing Instructions
 12 XML Declaration
  References/Bibliography

1 Basic XML Syntax

XML is a precisely specified standard for the representation and storage of data. The standards document (http://www.w3.org/TR/REC-xml/) is not too difficult to read and understand, but makes it very easy to get lost in a mass of technical detail and lose sight of the big picture of what is going on. This document attempts to give a simplified, but still accurate, overview of XML.

XML is a text markup language, i.e. first and foremost an XML document consists of text. However, this text can be divided into two parts that are mixed together:

Markup consists of the various tags (including attributes), declarations, processing instructions, comments, character and entity references in the document (we will look at each of these later). Markup can be identified through its use of special characters, e.g. tags begin with < and end with >.

The rest of an XML document consists of character data. Normally the character data makes up the content of the document, and the markup gives structure and interpretation to the content. For example, the following is a very simple but complete XML document:[2]

<employee>
  <firstname>Joe</firstname>
  <familyname>Bloggs</familyname>
</employee>

Here the markup consists of <employee>, <firstname>, </firstname>, <familyname>, </familyname> and </employee>. Ignoring the white space, the character data consists of Joe and Bloggs. The purpose of the markup is presumably to show that Joe Bloggs is an employee, with the first name Joe and the family name Bloggs.

Character data can include any Unicode character. If Unicode is not supported by the system used to create the XML document, Unicode characters can still be included by making use of 'character references'. These have the form &#N;, where N is a decimal number, or &#xH;, where H is a hexadecimal number. So if an employee had the family name Hüber, the character ü could be represented as &#252; or &#xFC; (or &#xfc;, since capitalization is not significant in hexadecimal numbers). (A more readable alternative is discussed in "Valid XML and DTDs", §5.)

In principle an XML document could have no character data; its sole purpose could be to describe a structure using only markup. Although any Unicode character can be used in character data, it is best to use only US-ASCII characters in markup.

The commonest component of markup is a tag. Tags consist of a name enclosed in angle brackets. The rules for valid names are similar to those in Java. In particular, names:

A special rule is that no name can start with the letter sequence XML, in any variation of upper and lower case letters.

Tags are of two kinds: start tags and end tags. In an end-tag, the name is preceded by /, in a start-tag it isn't. Thus <firstname> is a start-tag, </firstname> is an end-tag. It's allowed to have white space between the name and the closing >, but there must NOT be white space between the opening < and the name. Thus <firstname > and </firstname > are legal; < firstname> isn't.

2 XML Processors

Since it is fundamentally text-based, people can read and write XML (although writing it is easier if special editors are used). However, XML is primarily designed to be processed by appropriate computer programs, which will be called XML processors in what follows.

The XML standards describe how an XML processor must treat an XML document. All XML processors must check the syntax of the XML they process. A syntactically correct XML document is said to be 'well-formed' (see §7). Given a well-formed XML document, an XML processor is required to treat it in specific ways, most of which will be described in this and following handouts.

Opening an XML document in Firefox is a useful way to check whether it is well-formed.

XML Processors may also be required to validate the XML document. This is discussed in "Valid XML and DTDs", §1.

3 Elements

The principal components of XML documents are elements. An element starts with a start tag, ends with an end tag and may contain character data and other elements between the start and end tags. The first element in the document is the root element.

Consider the example we looked at above:

<employee>
  <firstname>Joe</firstname>
  <familyname>Bloggs</familyname>
</employee>

This XML document contains one root element, which in turn contains two child elements. The root element has the start tag <employee> and the end tag </employee>. The names in the start tag and end tag of an element must be the same. The content of the root element consists of two further elements whose names are firstname and familyname. These two elements each have content which consists of character data.

It's possible to mix elements and character data in the content of an element (although this is generally best avoided, as will be discussed in "Valid XML and DTDs", §3.4). Thus we could add a comment about Joe Bloggs by introducing a comment element:

<employee>
  <firstname>Joe</firstname>
  <familyname>Bloggs</familyname>
  <comment>Due to retire soon.</comment>
</employee>

or by just including some character data in the employee element:

<employee>
  <firstname>Joe</firstname>
  <familyname>Bloggs</familyname>
  Due to retire soon.
</employee>

Elements can have no content. For example, if we have nothing to say about Ravi Patel we might write the XML:

<employee>
  <firstname>Ravi</firstname>
  <familyname>Patel</familyname>
  <comment></comment>
</employee>

There's a shorthand notation for the special (and quite common) case of an element having empty content. The start and end tags can be merged into a single empty tag. This is like an end tag but with the / after the name instead of before it, e.g. <comment/> (white space is only allowed between the name and the /). There's no difference in meaning between <comment></comment> and <comment/>; the two should be treated in exactly the same way by any XML processor.

The content of an element is an ORDERED list of sub-components, whether those components are elements or character data. Furthermore, this list can have duplicate elements. Consider the following example where multiple phone numbers are recorded for an employee:

<employee>
  <firstname>Joe</firstname>
  <familyname>Bloggs</familyname>
  <phone>123456</phone>
  <phone>222222</phone>
  <phone>198765</phone>
</employee>

Note that:

In programming terms, the content of an element is an ordered list. In Java, a LinkedList or an ArrayList would be the closest matching classes.

4 Attributes

Elements can have attributes associated with them.

Attributes are associated with elements by listing the attributes, separated by white space, between the name and the closing > of the start tag, or between the name and the closing /> of the empty tag.

An attribute is a name/value pair. It associates a name with a value. Ignoring some issues about white space, the value is a string contained within single or double quotes. My convention is to use double quotes unless the string itself contains them.

The information about Joe Bloggs given in the XML above could alternatively be represented as:

<employee firstname="Joe" familyname="Bloggs">
  <phone number="123456" type="home"/>
  <phone number="222222" type="work"/>
  <phone number="198765" type="mobile"/>
</employee>

Whether or not this is a good representation is discussed in §6 below.

The rules for the attributes associated with an element are very different from those for the content of an element:

The first rule means that in practice it probably isn't a good idea to represent an employee's first name as an attribute, since employees may have more than one first name. In principle it would be possible to use multiple attributes with names such as firstname1, firstname2, etc. This is very bad practice: the intrinsic ordering of an employee's first names is being encoded in the attribute names, rather than in the structure of the markup. Experience has shown that this leads to larger amounts of more complicated code when such XML has to be processed.

From a programming point of view, an attribute list corresponds best to a map or an associative array, which in Java might be implemented with a HashMap class.

5 'Reference' attributes

One use of attributes is to enable one element to refer to another. This kind of structuring is the equivalent of HTML hypertext links within a document.

An element which is to be referenced needs to be given a special attribute which acts as an identifier. To make the reference unique, each identifier attribute must have a unique value. The value of an identifier attribute must be a valid name (see "Valid XML and DTDs", §4.5). It's often slightly annoying that an identifier value cannot be a number, since 'id numbers' are in common use. In an XML document, an initial letter must be added.

Once an element has a unique identifier, a reference to it can be added to another element as an identifier reference attribute. Thus we might represent a number of employees in the following way:

<employees>
  <employee id="e3256" manager="e3257">
    <firstname>Joe</firstname><familyname>Bloggs</familyname>
    <comment>Due to retire soon.</comment>
  </employee>
  <employee id="e3257">
    <firstname>Ravi</firstname><familyname>Patel</familyname>
    <comment/>
  </employee>
</employees>

Here the attribute id is used as an identifier, the attribute manager as an identifier reference.

6 Elements versus Attributes

The same content can often be represented either as character data in elements or as values in attributes. As an extreme example, we could present the employees data from above as:

<employees>
  <employee id="e3256" manager="e3257"
            firstname="Joe"
            familyname="Bloggs"
            comment="Due to retire soon." />
  ...
</employees>

There is no algorithm for choosing between elements and attributes to represent information in XML, but there are some clear 'rules of thumb', which suggest that the representation above is a poor choice.

Don't be afraid of 'verbosity'. Whenever an element's character data has some inherent structure, the XML should clearly show it. For example, an employee's name could represented as a single element, e.g.:

  <employee id="e3256" manager="e3257">
    <name>Joe Bloggs</name>
    <comment>Due to retire soon.</comment>
  </employee>

However, this means that an XML processor will have to parse the content of the name element in order to produce alphabetically sorted output. Since family names (surnames) can consist of multiple words in many cultures, such parsing is difficult, if not impossible.

Harold (2003) suggests representing names in a format such as:

<name>
  <given>Joe</given>
  <middle>William</middle>
  <family>Bloggs</family>
</name>

What advantages or disadvantages might this have compared to the representation used in §5 above?

7 Well-formedness

A document is only an XML document if it is well-formed (in effect syntactically correct). This means that it meets a basic set of restrictions. The most important are that:

There is a strict rule about processing XML documents. When processing any document which is found not to be well-formed, the processing of the document should immediately stop. No attempt should be made to process any part of the document after the point at which the error is found. Especially, no attempt should be made to try to correct the error or guess what the author intended.

This rule was decided upon in the light of the problems created by well-intentioned web browser developers when they tried to recover from errors in HTML code. They succeeded so well in their objective that browsers are very forgiving about all kinds of errors in HTML. The result has been that very little HTML on the web is free from errors because there has been no great incentive to get HTML right; if it looks right in the browser, then it's OK.

Although the developers' intentions may appear laudable, they have led to serious problems. Browsers diverged in how they handled what should have been incorrect HTML, so that standardization suffered. Developing a new browser is made immensely harder by the need to deal with legacy HTML, which has held up the development of better and more intelligent tools. The designers of XML wanted to ensure that XML tools would always be (relatively) easy to develop and therefore outlawed the practice of overlooking even minor errors in an XML document.

8 Escaping characters

Five characters have special meanings in XML, and so cannot be included just anywhere in character data or attribute values. For example, < signals the start of markup, such as a tag or a comment. The solution is to use predefined entities. An entity consists of a name enclosed in & and ;. The table below shows the five predefined entities in XML:

Character Entity
< &lt;
> &gt;
& &amp;
" &quot;
' &apos;

Some examples:

<module>
  <title>Information &amp; the Web</title>
</module>
 
<mailto>John Smith &lt;j.smith@somewhere.org&gt;</mailto>
 
<play title='Love&apos;s Labours Lost'/>

The entities &lt;, &gt; and &amp; must always be used.[3] The other two are only needed inside attribute values when the character they represent would otherwise terminate the value. The following is perfectly legal XML: [4]

<play title="Love's Labours Lost"/>

If there is a large chunk of character data (e.g. a Java language excerpt), it can be very tedious to escape each offending character, therefore a block-based escaping mechanism is provided in XML. A CDATA section can appear anywhere that character data can appear, but NOT in the value of an attribute. It is used to escape blocks of text that contain characters which would otherwise appear to be markup. A CDATA section starts with <![CDATA[ and ends with ]]>. Obviously, these character sequences were chosen to be very unlikely to appear in normal text. Within a CDATA section, any characters, other than the CDATA end string, can occur, including <, > and &. Thus the following are equivalent (ignoring white space issues -- see §9):

<java>
  boolean inRange(int n)
  { return (n &gt;= MIN &amp;&amp; n &lt;= MAX);
  }
</java>
 
<java>
  <![CDATA[
  boolean inRange(int n)
  { return (n >= MIN && n <= MAX);
  }
  ]]>
</java>

9 White Space Handling

In XML, the only characters considered to be white space characters are the space, carriage return, linefeed and tab characters (Unicode #x20, #xD, #xA and #x9 respectively). Other characters, such as formfeeds, non-breaking spaces, etc., are treated just like any normal printing character.

White space inside tags has been discussed in §1.

In general, white space is considered to be significant in XML. For example, XML processors are not allowed to trim white space from the beginning or end of character data, nor to modify any white space within the data, but have to keep the white space exactly as it appears.

However, we often add white space to an XML document just to lay it out neatly and we don't intend it to be significant. For example, we probably intend the following two pieces of XML to have exactly the same meaning:

<module>
  <title>Information &amp; the Web</title>
</module>
 
<module><title>Information &amp; the Web</title></module>

However, inside an XML processor, they will be represented differently. The first will correspond to a tree something like that shown below. The elements (part of the markup) are represented by nodes in the tree. A special node, labelled #text, has been used here to represent character data. Centred dots represent spaces and \n represents the linefeed (newline) character.

XML documents created on different platforms are likely to have different line ending markers: unix uses just #xA (linefeed); Windows uses the two-character sequence #xD #xA (return + linefeed); MacOS uses just #xD (return). XML processors are required to normalize line endings to the unix standard, i.e. just linefeed (newline), represented by \n above.

The internal representation of the second piece of XML will lack the text nodes on either side of the title element:

It's up to the XML processor to deal with any extra text nodes caused by white space (although we can add processing instructions to the XML document to give some information on what we expect it to do).

White space is handled differently in attribute values. First, every return, linefeed and tab character is replaced by a space character, so there will be the same number of white space characters, but all will be spaces. What happens next depends on something called the 'DTD', discussed in "Valid XML and DTDs". Multiple spaces may or may not be converted to a single space and spaces may or may not be trimmed from the start and end of the attribute value.

Thus these two pieces of XML are exactly equivalent regardless of the DTD:

<play title="Love's Labours Lost"/>
 
<play title="Love's
Labours
Lost"
/>

because the returns after Love's and Labours will be converted to spaces. On the other hand, this may or may not be equivalent, depending on the DTD, so is best avoided if at all possible: [5]

<play title=" Love's  Labours Lost "/>

10 Comments

Comments are written in XML by starting them with <!-- and ending them with -->. The string -- may not appear within a comment, i.e. the first occurrence of -- after the opening <!-- must be the beginning of the closing -->. In particular, this means that comments cannot be nested.

An unusual aspect of comments in XML, as compared to other languages, is that they are not automatically discarded or treated as white space, but rather are a proper part of the document. An XML processor is allowed (although not required) to process comments as well. However, designing an XML document to place significant information in comments in order to make use of this 'feature' would be extremely bad practice, particularly since the meaning of the document would depend on optional and hence non-predictable behaviour in the processing program.

11 Processing Instructions

Sometimes, some information about a document is not really part of the document's content or structure, but rather is information intended to tell other tools how to process the document. An example of this might be specifying a line width parameter to a tool which printed XML documents. Processing instructions are provided for this purpose.

Processing instructions begin with <? and end with ?>. One use is to specify a 'style sheet' to be used in displaying the XML document:

<?xml-stylesheet type="text/css" href="mystyle.css"?>

Although processing instructions look as if they have attributes, after the name of the processing instruction, the rest of the content up to the closing ?> is quite arbitrary. (The name of a processing instruction cannot be xml in any combination of lower and upper case letters.) An XML processor can either ignore any processing instructions or have an independent method of interpreting and obeying them. In other words, an XML document will be well-formed regardless of the content of a processing instruction.

Processing instructions can appear throughout the document anywhere that a comment can appear, including before the root element begins or after it ends.

12 XML Declaration

All XML documents should have an XML declaration, although it's not an error if they don't. An XML declaration must begin at the first character in the document, i.e. no white space, comments or anything else can occur before it. A sample XML declaration is:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

An XML declaration looks like a processing instruction, and has the same syntax and purpose, namely to give extra information about the XML document and how it should be processed. Strictly it's not a processing instruction, because it begins with the special reserved name xml. As with processing instructions, what look like attributes are technically not.

Thus the usual XML declaration that should appear at the start of every XML document is just:

<?xml version="1.0"?>

This makes it clear both to a human reader and to an XML processor that the document is indeed meant to be XML.

References/Bibliography

See the other handouts for this module, which are available online.

Harold, Elliotte Rusty (2003). Effective XML: 50 Specific Ways to Improve Your XML. Addison Wesley. 0-321-15040-6.

W3C (2006). The Extensible Markup Language (XML) 1.0 (Fourth Edition). W3C. http://www.w3.org/TR/2006/REC-xml-20060816/.

Footnotes

[1] This document is based on an original by Alan Sexton © 2007, modified by permission.

[2] Although it's good practice to have an XML declaration; see §12.

[3] This is a slight oversimplification with respect to the 'angle bracket' characters, but is a good rule to follow.

[4] However, since the name of a play is likely to be data rather than meta-data, it's almost certainly poor style.

[5] If attributes are only used for meta-data, it's unlikely that white space will be needed in the first place.

GoHome Page for "Information and the Web"