Introduction to XML

XML?

Extensible
XML is extensible in the sense that you are free to invent new elements ad hoc. If you need a new element you just invent it, and use it. We are our own bosses here
Markup
XML is a markup language, where markup means formatting. Markup in this context gives a certain sense of context, meaning to the marked up data. The markup aspect may be the first step towards assigning semantics to the language. There is however, no necessity in that.
Language
XML is a meta language. A language used to create other languages, or rather, other dialects of itself. We call them implementations. XHTML is one example, others are RSS, XSLT, etc. XML itself is defined in the mother of all meta languages, SGML.
Figure 39.1. A Bit of language History
A Bit of language History

XML, What for?

You might argue that XML is a markup language for data where the language does not suggest any particular use of these data.

To compare: HTML5 is a markup language for data where the lanmguage strongly suggests usage by a browser for human reading.

Any browser will happily display an XML file.

<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright NML, 2010. http://deformation.org -->
<canon>
  <book ref="Arn10" mycanon="all">
    <title>Grundbog i Web udvikling</title>
    <authors>
      <author>
        <firstname>Bror</firstname>
        <lastname>Arnfast</lastname>
      </author>
      <author>
        <firstname>Niels Müller</firstname>
        <lastname>Larsen</lastname>
      </author>
    </authors>
    <publisher>
      <name>Academica</name>
      <year>2010</year>
      <place>Aarhus, Dk</place>
    </publisher>
    <comments>yeah</comments>
  </book>
  <book ref="Llo08" mycanon="mdu,all">
    <title>Learn How to Build Web Sites the Right Way from Scratch</title>
    <edition>2</edition>
    <authors>
      <author>
        <firstname>Ian</firstname>
        <lastname>Lloyd</lastname>
      </author>
    </authors>
    <publisher>
      <name>SitePoint</name>
      <year>2008</year>
      <place>Collingwood, VIC, Australia</place>
    </publisher>
    <comments>1. sem: Chapters 1-7</comments>
  </book>
  <book ref="Mrq73" mycanon="all">
    <title>Skemalægning ved numerisk simulation</title>
    <authors>
      <author>
        <firstname>Hans</firstname>
        <lastname>Marqvardsen</lastname>
      </author>
    </authors>
    <publisher>
      <name>IMPOS</name>
      <year>1973</year>
      <place>Lyngby, Dk</place>
    </publisher>
    <comments>
      Licentiatafhandlinger ved IMSOR nr 18.
      LCCN: 80458284 
    </comments>
  </book>
</canon>

See the parsed result in your browser

If nothing in the XML file suggests otherwise XML will be displayed as the tree structure it is. We shall later dive into how we may suggest other representation.

Figure 39.2. An XML File rendered by a Browser
An XML File rendered by a Browser

XML, Close UP

<?xml version="1.0" encoding="utf-8"?>
<card>
	<name>Niels Müller Larsen</name>
	<title>Lektor, MSc</title>
	<email>nml@eaaa.dk</email>
	<uri>deformation.org, eaaa.eu, x15.dk</uri>
	<phone>+45 7228 6317</phone>
	<logo uri="nml.png"/>
</card>

See the parsed result in your browser

Structure

Look at XML from a structural viewpoint. It is a tree structured format which reminisces HTML5.

Let us first see an element:

<para>element content</para>

This is a start tag:

<para>

This is the corresponding end tag:

</para>

Whatever comes in between is the element content. In DOM speak, an element is called a node. Sometimes elements are empty, as in no content. In those cases you combine start and end tags into one, like this:

<para/>

In XHTML you may know some compulsarily empty elements, such as <br /> or <img ... />.

Attributes

Occasionally you may have a situation like:

<book>
    <id>42</id>
    <title>Hitchhikers Guide to the Galaxy</title>
    ...
</book>

Another XML writer might write:

<book id="42">
    <title>Hitchhikers Guide to the Galaxy</title>
    ...
</book>

Is there a difference? Structurally, yes. Semantically, probably not. The id="42" is called an attribute. Attributes have names, such as id, and they must have values such as 42. They are also known in XHTML. We reuse the example <img src="x.png" />

XML Declaration

In most cases an XML document may start with a first line:

<?xml version="1.0" encoding="UTF-8"?>

This start makes sense, however it is optional. The sense comes from the fact that the user, a program, then knows what kind of document it is.

Entities

Now and then, such as in this course material, one may need to write characters that are somehow part of the language it is written in. Or perhaps characters that you don't have on your keyboard. For these purposes we have character entities:

<br />

a break element, empty and only circumstantially visible. Or perhaps a copyright, several blanks (white space), and a bullet "©   •", as code this would look like: &copy;&nbsp;&nbsp;&nbsp;&bull;

Every entity has a numbered equivalent. Such as © = &#169;

XML has 5 predefined entities that are always there, and legal. All others have to be referenced from some file defineing them, a DTD.

The 5 are:

&quot; "
&amp; &
&apos; '
&lt; <
&gt; >

If your book does not describe these look them up in List of XML and HTML character entity references

XML, The Constitutional Rules

We distinguish between two levels of good quality XML. From your (X)HTML experience you may recall the ghost of validation. In that case we did/do it in order to achieve a higher level of interoperability. Big word for uniform behavior across browsers not to force any particular browser on the user. In XML, before we validate, we must have wellformedness otherwise the document is not XML:

  • A document has one single root element
  • Tags are properly nested
  • All elements are properly closed
  • All values of attributes are enclose in quotes

Validity is one step up the ladder in sense. To be valid a document must:

  • Be wellformed
  • Conform to a certain syntactical specification

The latter consists of validation according to a DTD, a Document Type Definition, or to an XML Schema. In this course we restrict ourselves reluctantly to DTDs only.

An XML document must be wellformed, it may be valid. Validity is an added quality that gives the programmer a layer of extra security protection against bad data. It works much like a create table with not null, and check in a database.