SSC1, 2010/2011, XML

XML in this module

XML was already introduced in notes on XML from Information and the Web in the first year.

In this module, we consider XML as a textual notation for tree-shaped data structures. Such data types are ubiquitous in computing. They are sometimes called "algebraic" data types. In Java, they can be represented by objects having references to other objects. In C, one would use structures and pointers for the same purpose. Modern functional language such as CAML have a datatype construct with built-in pattern matching; and this is so useful it may appear in C# in the future.

XHTML

XHTML is the modern standard for Web pages, superseding the messy HTML 4.01 Transitional. It represents a web page as a set of nested boxes of different kinds. The Web browser parses the XHTML into a tree and renders it as nested boxes according to a CSS file.

See the notes on XHTML from Information and the Web.

XML, trees, and data types

Note that the following situations are isomorphic. Isomorphic means, roughly, that we can translate back and forth while preserving all relevant structure. (Another example of an isomorphism is given by serializing a Java object and reading it back into memory.)

  1. Ordered trees with labels at the node. For example, the root of the tree has label A, its left child has label B and its right child label C. C has an only child with label D, and D itself contains some text.
  2. A sequence of nested boxes of different kinds. For example, a box of kind A contains a box of kind B followed by a box of kind C. The latter contains a single box of kind D. More concretely, the outer box could be an XHTML list environment, and both B and C could be list items.
  3. A fragment of XML defining an element A containing elements B and C:
    <A>
        <B>
        </B>
        <C>
            <D>
            Fnord
            </D>
        </C>
    </A>
    
  4. An expression in fully bracket prefix form. This is the syntax of the programming language Lisp, where it is called an S-expression. For example
     
         (A (B) (C (D "Fnord"))) 
    
  5. A Java object of class A, where we have the class definitions:
    class A { B b; C c; }
    
    class B { }
    
    class C { D d; }
    
    class D { String fnord; } 
    This style of representing trees with objects is called the Composite Pattern. Note that this pattern uses very precise typing. For arbitrary XML structures, we have to use a more permissive tree representation, where all nodes have the same type and a list of children. The information whether a node is an A or B would then not be part of the class, but would be kept in a variable.

Well-formed XML

An XML file is legal or well-formed if it meets certain syntactic conditions. The most important of these conditions is that all the tags match. If the tags do not match, we do not even get a tree structure without ambiguity. In that case, the XML parser has to reject the input as gibberish.

Valid XML

If an XML file is well-formed, it may still not have the structure we expect it to have. For instance, do we want an element C to have a D child?

In XML, a Document Type Definition (DTD) imposes extra structure on XML files. XML is called valid if it has the required structure. An XML parser can check these conditions automagically. This means we have to do less manual input validation. We also get a better fit between XML and the data structures we can construct from it.

The following DTD describes our XML example. Note the similarity to the types in the Composite Pattern.

<!ELEMENT A (B,C)>
<!ELEMENT B EMPTY>
<!ELEMENT C (D)>
<!ELEMENT D (#PCDATA)>

The constructs for specifying the children of an element in a DTD are based on regular expressions. One has sequential composition, alternative, and repetition, written as the Kleene star *.

DTD have limitations. More powerful alternatives include XML Schema and Relax NG. The latter has a reasonable syntax based on grammars and regular expressions.

Tree parser for XML: DOM

Some parsers contruct a tree of objects in memory for the whole XML input file. We navigate the tree by calling methods that return the list of children and other information at each node. This is standard Java tree programming.

Pull parser for XML: StAX

When keeping the whole tree in memory is too inefficient, one can use a "pull" parser that only parses nodes as they are needed. StAX represent the outcome of parsing as a stream of events. For example, a streaming parser could tell us that it has seen the beginning of an A element, followed by the beginning of a B element, followed by the end of a B element, followed by the beginning of a C element, and so on and so forth.

The major difference in programming with a tree parser and a streaming parser is how to handle state.