Introduction to XHTML
Peter Coxhead [1]
Contents
1 XHTML
2 Strictly Conforming XHTML
3 XHTML Compatibility Issues
References/Bibliography
1 XHTML
XHTML is basically just an 'XML-ized' version of HTML. It supports all the essential HTML elements and attributes. However, because it is in XML syntax, it must be well-formed. This stricter syntax is much more easily processed by tools, e.g. search engines, semantic web tools, screen scrapers for price comparisons, etc. One important constraint to remember is that since XML names are case sensitive, all XHTML element names and attributes must be in lower case.
There are a number of different versions of the XHTML standard:
- XHTML 1.0 Strict is the same as HTML 4.01 Strict, but follows
XML syntax rules. In particular, it does not allow the use of presentational
elements such as
center,fontorstrike. Presentational attributes such asalign,bgcolorandbackgroundare similarly banned. Furthermore, it does not allow frames or applets. - XHTML 1.0 Transitional extends XHTML 1.0 strict to allow some common deprecated elements and attributes to be used. When starting with a legacy HTML page that you wish to translate to XHTML, it is often best to make it work as an XHTML 1.0 Transitional page first (which mostly involves making it into well formed XML) before moving on to make the changes necessary to turn it into XHTML 1.0 Strict (which often involves replacing presentational aspects with CSS styling).
- XHTML 1.0 Frameset adds frames to XHTML 1.0 Transitional but removes some of the legacy items that are deprecated in HTML 4.01 Strict.
- XHTML 1.1, Module-based XHTML, is a newer version of XHTML which is basically XHTML 1.0 Strict, but rewritten in a modularized form making much use of advanced DTD inclusions, overriding, conditional constructs and name spaces. The end result is a specification and set of DTDs that can be mixed and matched with other XML languages such as MathML, SVG and others.
Only XHTML 1.0 Strict will be covered in this module. My selection of core XHTML constructs is given in "Summary of Core XHTML".
There are many XHTML references online. Miroslav Nic's site can be recommended; the Wikipedia entry for XHTML has a number of links to specifications, references and online XHTML validators. The XHTML 1.0 standard is quite readable. See the References/Bibliography section below.
2 Strictly Conforming XHTML 1.0
The XHTML 1.0 specification introduced a new type of correctness to add to XML's well-formedness and validity: strictly conforming. This involves meeting a number of requirements (paraphrased or copied from the XHTML 1.0 standard):
- It must be valid according to one of the XHTML 1.0 DTDs.
- The root element must be
html. - The root element must contain an XML namespace declaration for the XHTML namespace. The XHTML namespace is defined to be "http://www.w3.org/1999/xhtml". An example root element is:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
- There must be a
DOCTYPEdeclaration in the document prior to the root element. For XHTML 1.0 strict this must be:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- Any additional DTD must not be used to override any parameter entities in the XHTML DTD.
Thus a simple example of a strictly conforming XHTML 1.0 Strict document is the following:
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><title>Virtual Library</title></head><body><p>Moved to <a href="http://example.org/">example.org</a>.</p></body></html>
3 XHTML Compatibility Issues
When a web server sends a document to a client, e.g. a browser, it also sends information on what kind of document it is. There are several ways in which this information can be presented, but the trigger is often the document's extension. XHTML can be 'served' (i.e. presented) to a browser in two quite distinct modes:
- As if it were just a variant of HTML, corresponding to the extensions '.html' or '.htm'.
- As if it were XML, or in particular the specific XHTML subset of XML, corresponding to the extensions '.xml' or '.xhtml'.
In the first mode, browsers typically handle XHTML in much the same way that they handle HTML. They do not enforce well-formedness, validity or conformity to any greater degree than they do with normal HTML (i.e. very little!).
In the second mode, browsers should enforce all three of these requirements. However, XHTML served as XML/XHTML is currently poorly supported by web browsers. In particular, Microsoft browsers up to Version 7 of Internet Explorer do not support it at all. Hence, for compatibility reasons, it is usually best to write XHTML so that it can be interpreted as pure HTML and put it into documents with the extension '.html'. This can be done without losing the advantages of writing proper XHTML and is now the recommended approach for all web page development.
However, it does require writing a somewhat 'dumbed down' version of XHTML. A full list of guidelines will be found in Appendix C of the XHTML 1.0 standard. Some of the most important are briefly described here:
- Remove processing instructions and the XML declaration. Some browsers will take the XML declaration to mean that this is really an XML document and not a HTML (or XHTML) document at all and therefore render the page incorrectly. The XML declaration is optional in XML documents (although its absence means that the character encoding must be UTF-8, UTF-16 or a subset thereof).
- Write empty elements with a space after the element name.
Many user agents will not correctly handle empty elements written as, e.g.,
<br/>(possibly mistaking the name of the element asbr/) or as<br></br>(possibly interpreting this as two independentbrelements). The safe way to write such elements is as<br />. - Never use the shortcut form for empty elements that do not
have to be empty. Elements like the paragraph element
por thescriptelement, can be empty, although they usually are not. If you need to use an empty paragraph, never shorten it to<p />. Instead always leave it as<p></p>(although some validators will then complain about an empty paragraph). - Try to avoid using internal style sheets or scripts that
contain markup characters, such as
<or&. It is common in HTML to enclose internal style sheet and scripts in comments in order to make them available to the style and script processors while not interfering with a legacy browser's rendering of the HTML code. However, XML parsers are allowed to silently discard XML comments. Hence internal scripts and styles may be discarded before their processors get to see them, so that scripts that use the above mentioned characters may not be processed correctly by XML parsers. The problem is explored further in Summary of Core XHTML, under the heading Validity. - Be careful with anchor names. In HTML, the
aelement with anameattribute is used to identify the destination of a 'fragment' (a URL with a#followed by some name at the end). This allows linking to a location within a document rather than just to the document as a whole. Other elements can have anameattribute for the same purpose. In XHTML, the use of such anameattribute is deprecated. Instead you should use anidattribute. However, for compatibility with all browsers, you need to provide both anameand anidattribute with the same identifier value. Use only the more restricted XML syntax for identifier names (see Introduction to XML, §1). - Be careful with boolean attributes, i.e. HTML attributes
whose values can be
trueorfalse. A short form that is common, although not allowed by the HTML 4 standard, is to leave out="true"for true values and leave out the entire attribute for false values. Thus, for example, instead ofismap="true", it has been more common to write onlyismap. This is, of course, not allowed in XML. However, some browsers will not correctly handle the full form of boolean attributes, which makes ensuring compatibility difficult. - Always use the ampersand entity (
&) instead of an ampersand character in character data and attribute values. Many browsers allow an&character to be interpreted as a simple&if it does not appear to be the beginning of an entity reference. This is not allowed at all in XHTML. - If you need to escape a single quote character, use
'or'instead of'. Theaposcharacter entity is predefined in XML but not in HTML. The numeric character entities will work in both. - Be careful with white space characters. Some characters that are treated as white space in HTML documents are not in XML document. For example, in HTML, the formfeed character is treated as white space; in XHTML it is illegal. The only white space characters allowed in XML (and hence XHTML) are space, newline, return and tab.
References/Bibliography
See the other handouts for this module, which are available online.
Goodman, Danny (2007). Dynamic HTML: The Definitive Reference. O'Reilly. 0-596-52740-3. [This is an excellent and detailed reference manual for browser compatibility of all kinds of HTML, CSS, JavaScript, and DOM related issues. However, it is purely a reference and not suitable for learning these topics in the first instance.]
Nic, Miroslav (2002). XHTML 1.0 reference with
examples.
http://www.zvon.org/xxl/xhtmlReference/Output/index.html.
[Excellent online reference for XHTML.]
W3C (2002). XHTML 1.0 The Extensible HyperText Markup Language (Second Edition). http://www.w3.org/TR/xhtml1/.
Wikipedia (undated). Wikipedia Entry for XHTML. http://en.wikipedia.org/wiki/XHTML.
Footnotes
[1] This document is based on an original by Alan Sexton © 2007, modified by permission.