Introduction to XML
Peter Coxhead [1]
Contents
1 Basic XML Syntax
2 XML Processors
3 Elements
4 Attributes
5 'Reference' Attributes
6 Elements versus Attributes
7 Well-formedness
8 Escaping Characters
9 White Space Handling
10 Comments
11 Processing Instructions
12 XML Declaration
References/Bibliography
1 Basic XML Syntax
XML is a precisely specified standard for the representation and storage of data. The standards document (http://www.w3.org/TR/REC-xml/) is not too difficult to read and understand, but makes it very easy to get lost in a mass of technical detail and lose sight of the big picture of what is going on. This document attempts to give a simplified, but still accurate, overview of XML.
XML is a text markup language, i.e. first and foremost an XML document consists of text. However, this text can be divided into two parts that are mixed together:
- markup
- character data.
Markup consists of the various tags (including
attributes), declarations, processing instructions, comments, character and
entity references in the document (we will look at each of these later). Markup
can be identified through its use of special characters, e.g. tags begin with
< and end with >.
The rest of an XML document consists of character data. Normally the character data makes up the content of the document, and the markup gives structure and interpretation to the content. For example, the following is a very simple but complete XML document:[2]
<employee><firstname>Joe</firstname><familyname>Bloggs</familyname></employee>
Here the markup consists of <employee>,
<firstname>, </firstname>,
<familyname>, </familyname> and
</employee>. Ignoring the white
space, the character data consists of Joe and
Bloggs. The purpose of the markup is presumably to show that Joe
Bloggs is an employee, with the first name Joe and the family name Bloggs.
Character data can include any Unicode character. If Unicode is not supported by the system used to create the XML document, Unicode characters can still be included by making use of 'character references'. These have the form &#N;, where N is a decimal number, or &#xH;, where H is a hexadecimal number. So if an employee had the family name Hüber, the character ü could be represented as ü or ü (or ü, since capitalization is not significant in hexadecimal numbers). (A more readable alternative is discussed in "Valid
XML and DTDs", §5.)
In principle an XML document could have no character data; its sole purpose could be to describe a structure using only markup. Although any Unicode character can be used in character data, it is best to use only US-ASCII characters in markup.
The commonest component of markup is a tag. Tags consist of a name enclosed in angle brackets. The rules for valid names are similar to those in Java. In particular, names:
- must begin with a letter or an underscore (_)
- may then contain any number of letters or numbers (plus some
other characters such as
-and_) - are case-sensitive.
A special rule is that no name can start with the letter sequence
XML, in any variation of upper and lower case letters.
Tags are of two kinds: start tags and
end tags. In an end-tag, the name is preceded by
/, in a start-tag it isn't. Thus <firstname> is
a start-tag, </firstname> is an end-tag.
It's allowed to have white
space between the name and the closing >, but there must
NOT be white space between the opening
< and the name. Thus <firstname > and
</firstname > are legal; < firstname>
isn't.
2 XML Processors
Since it is fundamentally text-based, people can read and write XML (although writing it is easier if special editors are used). However, XML is primarily designed to be processed by appropriate computer programs, which will be called XML processors in what follows.
The XML standards describe how an XML processor must treat an XML document. All XML processors must check the syntax of the XML they process. A syntactically correct XML document is said to be 'well-formed' (see §7). Given a well-formed XML document, an XML processor is required to treat it in specific ways, most of which will be described in this and following handouts.
Opening an XML document in Firefox is a useful way to check whether it is well-formed.
XML Processors may also be required to validate the XML document. This is discussed in "Valid XML and DTDs", §1.
3 Elements
The principal components of XML documents are elements. An element starts with a start tag, ends with an end tag and may contain character data and other elements between the start and end tags. The first element in the document is the root element.
Consider the example we looked at above:
<employee><firstname>Joe</firstname><familyname>Bloggs</familyname></employee>
This XML document contains one root element, which in turn contains two child
elements. The root element has the start tag <employee> and
the end tag </employee>. The names in the start tag and end
tag of an element must be the same. The content of the root element consists of
two further elements whose names are firstname and
familyname. These two elements each have content which consists of
character data.
It's possible to mix elements and character data in the content of an element
(although this is generally best avoided, as will be discussed in
"Valid XML and DTDs", §3.4). Thus
we could add a comment about Joe Bloggs by introducing a comment
element:
<employee><firstname>Joe</firstname><familyname>Bloggs</familyname><comment>Due to retire soon.</comment></employee>
or by just including some character data in the employee
element:
<employee><firstname>Joe</firstname><familyname>Bloggs</familyname>Due to retire soon.</employee>
Elements can have no content. For example, if we have nothing to say about Ravi Patel we might write the XML:
<employee><firstname>Ravi</firstname><familyname>Patel</familyname><comment></comment></employee>
There's a shorthand notation for the special (and quite common) case of an
element having empty content. The start and end tags can be merged into a single
empty tag. This is like an end tag but with the
/ after the name instead of before it, e.g. <comment/>
(white space is only allowed between the name
and the /). There's no difference in meaning between
<comment></comment> and <comment/>;
the two should be treated in exactly the same way by any
XML processor.
The content of an element is an ORDERED list of sub-components, whether those components are elements or character data. Furthermore, this list can have duplicate elements. Consider the following example where multiple phone numbers are recorded for an employee:
<employee><firstname>Joe</firstname><familyname>Bloggs</familyname><phone>123456</phone><phone>222222</phone><phone>198765</phone></employee>
Note that:
- There may be any number of sub-elements of the same name in the content of an element. This is true even if the content of those sub-elements is the same (or even if they are all empty).
- The order of sub-elements may be significant; e.g. perhaps the intention behind the XML above is that when trying to contact someone, the numbers should always be tried in the given order. All XML processors must preserve the order of elements in the document.
In programming terms, the content of an element is an ordered list. In Java,
a LinkedList or an ArrayList would be the closest
matching classes.
4 Attributes
Elements can have attributes associated with them.
Attributes are associated with elements by listing the attributes, separated
by white space, between the name and the closing
> of the start tag, or between the name and the closing
/> of the empty tag.
An attribute is a name/value pair. It associates a name with a value. Ignoring some issues about white space, the value is a string contained within single or double quotes. My convention is to use double quotes unless the string itself contains them.
The information about Joe Bloggs given in the XML above could alternatively be represented as:
<employee firstname="Joe" familyname="Bloggs"><phone number="123456" type="home"/><phone number="222222" type="work"/><phone number="198765" type="mobile"/></employee>
Whether or not this is a good representation is discussed in §6 below.
The rules for the attributes associated with an element are very different from those for the content of an element:
- There can only be ONE attribute
with a given name associated with an element. Thus I had to use elements for the
phone numbers above. It's not possible to give an
employeeelement multiple phone numbers by giving it multiplephoneattributes. - The order of attributes is NOT significant. No meaning can be attached to the order in which attributes occur. Any XML processor is free to rearrange the order of attributes in any way it chooses.
The first rule means that in practice it probably isn't a good idea to
represent an employee's first name as an attribute, since employees may have
more than one first name. In principle it would be possible to use multiple
attributes with names such as firstname1, firstname2,
etc. This is very bad practice: the intrinsic ordering of an employee's first
names is being encoded in the attribute names, rather than in the structure of
the markup. Experience has shown that this leads to larger amounts of more
complicated code when such XML has to be processed.
From a programming point of view, an attribute list corresponds best to a map
or an associative array, which in Java might be implemented with a
HashMap class.
5 'Reference' attributes
One use of attributes is to enable one element to refer to another. This kind of structuring is the equivalent of HTML hypertext links within a document.
An element which is to be referenced needs to be given a special attribute which acts as an identifier. To make the reference unique, each identifier attribute must have a unique value. The value of an identifier attribute must be a valid name (see "Valid XML and DTDs", §4.5). It's often slightly annoying that an identifier value cannot be a number, since 'id numbers' are in common use. In an XML document, an initial letter must be added.
Once an element has a unique identifier, a reference to it can be added to another element as an identifier reference attribute. Thus we might represent a number of employees in the following way:
<employees><employee id="e3256" manager="e3257"><firstname>Joe</firstname><familyname>Bloggs</familyname><comment>Due to retire soon.</comment></employee><employee id="e3257"><firstname>Ravi</firstname><familyname>Patel</familyname><comment/></employee></employees>
Here the attribute id is used as an identifier, the attribute
manager as an identifier reference.
6 Elements versus Attributes
The same content can often be represented either as character data in elements or as values in attributes. As an extreme example, we could present the employees data from above as:
<employees><employee id="e3256" manager="e3257"firstname="Joe"familyname="Bloggs"comment="Due to retire soon." />...</employees>
There is no algorithm for choosing between elements and attributes to represent information in XML, but there are some clear 'rules of thumb', which suggest that the representation above is a poor choice.
- Elements should be the default for representing data, i.e. the content itself. Unlike attributes, they represent structure straightforwardly (as trees of elements and sub-elements), preserving order and allowing multiple occurrences. Because of these properties, it is easier to extend an element if requirements change.
- Attributes should be the default for representing 'meta-data', i.e. data about the content. For example, we might want to record the language in which some text is written. An attribute is the natural choice. Typically, attributes are not intended to be presented to someone interested in the content represented by the XML document; rather they record information about that content to be used by the XML processor.
- Attributes should be used to create cross-references as in §5 above, even if the reference values are real data (such as employee IDs). This enables an XML processor to check the validity of reference values and cross-references (see "Valid XML and DTDs", §4.5-§4.7).
Don't be afraid of 'verbosity'. Whenever an element's character data has some inherent structure, the XML should clearly show it. For example, an employee's name could represented as a single element, e.g.:
<employee id="e3256" manager="e3257"><name>Joe Bloggs</name><comment>Due to retire soon.</comment></employee>
However, this means that an XML processor will have to parse the content of
the name element in order to produce alphabetically sorted output.
Since family names (surnames) can consist of multiple words in many cultures,
such parsing is difficult, if not impossible.
Harold (2003) suggests representing names in a format such as:
<name><given>Joe</given><middle>William</middle><family>Bloggs</family></name>
What advantages or disadvantages might this have compared to the representation used in §5 above?
7 Well-formedness
A document is only an XML document if it is well-formed (in effect syntactically correct). This means that it meets a basic set of restrictions. The most important are that:
- There is only ONE root element.
- The elements are properly nested (i.e. the start and end tags of an element do not cross the boundary of another element).
- There are no duplicate attribute names in the list of an element's attributes.
- Elements and attributes have properly formed names.
There is a strict rule about processing XML documents. When processing any document which is found not to be well-formed, the processing of the document should immediately stop. No attempt should be made to process any part of the document after the point at which the error is found. Especially, no attempt should be made to try to correct the error or guess what the author intended.
This rule was decided upon in the light of the problems created by well-intentioned web browser developers when they tried to recover from errors in HTML code. They succeeded so well in their objective that browsers are very forgiving about all kinds of errors in HTML. The result has been that very little HTML on the web is free from errors because there has been no great incentive to get HTML right; if it looks right in the browser, then it's OK.
Although the developers' intentions may appear laudable, they have led to serious problems. Browsers diverged in how they handled what should have been incorrect HTML, so that standardization suffered. Developing a new browser is made immensely harder by the need to deal with legacy HTML, which has held up the development of better and more intelligent tools. The designers of XML wanted to ensure that XML tools would always be (relatively) easy to develop and therefore outlawed the practice of overlooking even minor errors in an XML document.
8 Escaping characters
Five characters have special meanings in XML, and so cannot be included just
anywhere in character data or attribute values. For example, <
signals the start of markup, such as a tag or a comment. The solution is to use
predefined entities. An
entity consists of a name enclosed in
& and ;. The table below shows the five predefined
entities in XML:
| Character | Entity |
< |
< |
> |
> |
& |
& |
" |
" |
' |
' |
Some examples:
<module><title>Information & the Web</title></module><mailto>John Smith <j.smith@somewhere.org></mailto><play title='Love's Labours Lost'/>
The entities <, > and
& must always be used.[3]
The other two are only needed inside attribute values when the character they
represent would otherwise terminate the value. The following is perfectly legal
XML: [4]
<play title="Love's Labours Lost"/>
If there is a large chunk of character data (e.g. a Java language excerpt),
it can be very tedious to escape each offending character, therefore a
block-based escaping mechanism is provided in XML. A CDATA section
can appear anywhere that character data can appear, but
NOT in the value of an attribute. It is used to
escape blocks of text that contain characters which would otherwise appear to be
markup. A CDATA section starts with <![CDATA[ and
ends with ]]>. Obviously, these character sequences were chosen
to be very unlikely to appear in normal text. Within a CDATA
section, any characters, other than the CDATA end string, can
occur, including <, > and &.
Thus the following are equivalent (ignoring white space issues -- see
§9):
<java>boolean inRange(int n){ return (n >= MIN && n <= MAX);}</java><java><![CDATA[boolean inRange(int n){ return (n >= MIN && n <= MAX);}]]></java>
9 White Space Handling
In XML, the only characters considered to be white space characters are the space, carriage return, linefeed and tab characters (Unicode #x20, #xD, #xA and #x9 respectively). Other characters, such as formfeeds, non-breaking spaces, etc., are treated just like any normal printing character.
White space inside tags has been discussed in §1.
In general, white space is considered to be significant in XML. For example, XML processors are not allowed to trim white space from the beginning or end of character data, nor to modify any white space within the data, but have to keep the white space exactly as it appears.
However, we often add white space to an XML document just to lay it out neatly and we don't intend it to be significant. For example, we probably intend the following two pieces of XML to have exactly the same meaning:
<module><title>Information & the Web</title></module><module><title>Information & the Web</title></module>
However, inside an XML processor, they will be represented differently. The
first will correspond to a tree something like that shown below. The elements
(part of the markup) are represented by nodes in the tree. A special node,
labelled #text, has been used here to represent character data.
Centred dots represent spaces and \n represents the linefeed
(newline) character.

XML documents created on different platforms are likely to have different
line ending markers: unix uses just #xA (linefeed); Windows uses the
two-character sequence #xD #xA (return + linefeed); MacOS uses just #xD
(return). XML processors are required to normalize
line endings to the unix standard, i.e. just linefeed (newline), represented by
\n above.
The internal representation of the second piece of XML will lack the text
nodes on either side of the title element:

It's up to the XML processor to deal with any extra text nodes caused by white space (although we can add processing instructions to the XML document to give some information on what we expect it to do).
White space is handled differently in attribute values. First, every return, linefeed and tab character is replaced by a space character, so there will be the same number of white space characters, but all will be spaces. What happens next depends on something called the 'DTD', discussed in "Valid XML and DTDs". Multiple spaces may or may not be converted to a single space and spaces may or may not be trimmed from the start and end of the attribute value.
Thus these two pieces of XML are exactly equivalent regardless of the DTD:
<play title="Love's Labours Lost"/><play title="Love'sLaboursLost"/>
because the returns after Love's and Labours will
be converted to spaces. On the other hand, this may or may not be equivalent,
depending on the DTD, so is best avoided if at all possible:
[5]
<play title=" Love's Labours Lost "/>
10 Comments
Comments are written in XML by starting them with <!-- and
ending them with -->. The string -- may not appear
within a comment, i.e. the first occurrence of -- after the opening
<!-- must be the beginning of the closing -->.
In particular, this means that comments cannot be nested.
An unusual aspect of comments in XML, as compared to other languages, is that they are not automatically discarded or treated as white space, but rather are a proper part of the document. An XML processor is allowed (although not required) to process comments as well. However, designing an XML document to place significant information in comments in order to make use of this 'feature' would be extremely bad practice, particularly since the meaning of the document would depend on optional and hence non-predictable behaviour in the processing program.
11 Processing Instructions
Sometimes, some information about a document is not really part of the document's content or structure, but rather is information intended to tell other tools how to process the document. An example of this might be specifying a line width parameter to a tool which printed XML documents. Processing instructions are provided for this purpose.
Processing instructions begin with <? and end with
?>. One use is to specify a 'style sheet' to be used in
displaying the XML document:
<?xml-stylesheet type="text/css" href="mystyle.css"?>
Although processing instructions look as if they have attributes, after the
name of the processing instruction, the rest of the content up to the closing
?> is quite arbitrary. (The name of a processing instruction
cannot be xml in any combination of lower and upper case letters.)
An XML processor can either ignore any processing
instructions or have an independent method of interpreting and obeying them. In
other words, an XML document will be well-formed regardless of the content of a
processing instruction.
Processing instructions can appear throughout the document anywhere that a comment can appear, including before the root element begins or after it ends.
12 XML Declaration
All XML documents should have an XML declaration, although it's not an error if they don't. An XML declaration must begin at the first character in the document, i.e. no white space, comments or anything else can occur before it. A sample XML declaration is:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
An XML declaration looks like a processing instruction, and has the same
syntax and purpose, namely to give extra information about the XML document and
how it should be processed. Strictly it's not a processing instruction, because
it begins with the special reserved name xml. As with processing
instructions, what look like attributes are technically not.
- The version information is required, and identifies (unsurprisingly) the version of XML that is used in the document. Currently, the only versions available are 1.0 and 1.1. There is almost no reason ever to use any version other than 1.0. (The problems that come with the change to version 1.1 are significant and are discussed in detail by Harold (2003).)
- The encoding declaration is optional. It specifies the way that characters are encoded in the document. Character coding is a complex subject, and unless you fully understand it, it's best to leave out this declaration. The default is then UTF-8 (one of the standards for encoding Unicode characters). Since both US-ASCII (with 128 characters) and ISO-8859-1 (with 256 characters) are subsets of UTF-8, the default is usually correct for software used outside the Far East. Note that any Unicode character can always be entered via the character reference mechanism.
- The stand alone declaration is optional. It indicates whether the document can be fully processed by itself or whether some external document declared in this XML document must also be processed. The default is no, which is almost always what is needed. See the XML 1.0 standard (W3C 2006) for further details.
Thus the usual XML declaration that should appear at the start of every XML document is just:
<?xml version="1.0"?>
This makes it clear both to a human reader and to an XML processor that the document is indeed meant to be XML.
References/Bibliography
See the other handouts for this module, which are available online.
Harold, Elliotte Rusty (2003). Effective XML: 50 Specific Ways to Improve Your XML. Addison Wesley. 0-321-15040-6.
W3C (2006). The Extensible Markup Language (XML) 1.0 (Fourth Edition). W3C. http://www.w3.org/TR/2006/REC-xml-20060816/.
Footnotes
[1] This document is based on an original by Alan Sexton © 2007, modified by permission.
[2] Although it's good practice to have an XML declaration; see §12.
[3] This is a slight oversimplification with respect to the 'angle bracket' characters, but is a good rule to follow.
[4] However, since the name of a play is likely to be data rather than meta-data, it's almost certainly poor style.
[5] If attributes are only used for meta-data, it's unlikely that white space will be needed in the first place.