eXtensible Markup Language
XML is a meta markup language, meaning it tells you what form the markup takes, not what markup is allowed.
"Specific" markup languages are called applications of XML. Examples would be things like
and thousands more. XML is used EVERYWHERE.
XML was designed to be
The official specification is at http://www.w3.org/TR/REC-xml. There is a nice "annotated version" of an older spec at http://www.xml.com/axml/testaxml.htm.
<people>
<person social="235432099">
<name>
<first>Seán</first>
<last>Mchunu</last>
</name>
<job>Teacher</job>
<job salaried="no">Clerk</job>
<birthdate>1975-06-22</birthdate>
<married spouse="355641111"/>
<picture src="http://smchunu.name/me.jpg" width="60" height="80"/>
<birthplace>
<city>Los Angeles</city>
<country>us</country>
</birthplace>
</person>
</people>
When viewed as a characater sequence, an XML document has
Character Data
Markup: start tags, end tags, empty element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, processing instructions. (More on these later.)
XML documents are always Unicode character sequences. If you have a wimpy text editor, you can always use character entity references, for example
<greeting>&#1055;&#1088;&#1080;&#x432;&#x435;&#1090;, &#x41c;&#1080;&#x440;!</greeting>
Note how you can use hex or decimal for the codepoints.
The document defines a structured object:

This shows elements and attributes. Note the difference between elements and tags.
There are actually 7 kinds of nodes
More on these later.
A document is well-formed if
Documents that are not well-formed should be rejected by a processing program.
Find the XML grammar in the spec, or here.
A few of the grammar rules:
document ::= prolog element Misc*
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
S ::= (#x20 | #x9 | #xD | #xA)+
prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
Eq ::= S? '=' S?
VersionNum ::= '1.0'
Misc ::= Comment | PI | S
element ::= EmptyElemTag | STag content ETag
STag ::= '<' Name (S Attribute)* S? '>'
Attribute ::= Name Eq AttValue
ETag ::= '</' Name S? '>'
content ::= CharData? ((element | Reference | CDSect | PI | Comment) CharData?)*
EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'
.
.
.
This means an XML document
- Starts with an optional XML declaration
- Then has an optional Document Type Declaration
- Then has a single Element
- Spaces, comments, and processing instructions can appear sprinkled throughout (but only in certain places)
Like all non-trivial languages, there are some things you can't express in a context free grammar....
An element must have either (1) a start AND and end tag, or (3) an empty tag only (i.e., no optional tags)
The name in an end tag must match the name in the corresponding start tag (i.e., no overlapping elements)
Attributes must have the form name=value
Attribute values must be quoted
No '<' characters are allowed in attributes
No external entities are allowed in attribtues
Attributes in an element must be unique
Entity references can only refer to entities that have been declared (however the five entities < > ' " & are pre-declared for you)
No isolated markup characters are allowed in text
< ==> <
& ==> &
]]> ==> ]]>
Entity references cannot contain the name of an unparsed entity
Entity references cannot be directly or indirectly recursive
Parameter entity references can only appear in a DTD
Example
<?xml version="1.0" encoding="utf-8" standalone="no"?>
If present, must be the first thing in the document. No whitespace or comments may precede it. Only a Unicode byte-order mark, but that's different. That way a processor can guess the encoding well enough to get to the encoding declaration.
BOM present ----------- 00 00 f3 ff UTF-32BE (1234) ff fe 00 00 UTF-32LE (4321) 00 00 ff fe UTF-32 weird (2143) fe ff 00 00 UTF-32 weird (3412) fe ff 00 3c UTF-16BE ff fe 3c 00 UTF-16LE ef bb bf UTF-8 No BOM, gets you started, though -------------------------------- 00 00 00 3c UTF-32 3c 00 00 00 " 00 00 3c 00 " 00 3c 00 00 " 00 3c 00 3f UTF-16BE 3c 00 3f 00 UTF-16LE 3c 3f 78 6d UTF-8, Latin-1, ASCII, etc. 4c 6f a7 94 EBCDIC other
Encoding could be utf-8, utf-16, iso-10646-UCS-2, iso-10646-UCS-4, iso-8859-1, ..., iso-8859-15, iso-2202-jp, Shift_JIS, EUC-JP, ...
Standalone must be "no" (or omitted) whenever the document refers to entities that are externally declared.
A document type definition (DTD) "explains precisely which elements and entities may appear where in a document, and what those elements' contents and attributes are". A DTD for the example document above:
<!ELEMENT people (person*)>
<!ELEMENT person (name, job*, birthdate, married? picture?,
birthplace?)>
<!ATTLIST person
social ID #REQUIRED>
<!ELEMENT name (first, middle?, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT middle (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT job (#PCDATA)>
<!ATTLIST job
salaried (yes|no) "yes">
<!ELEMENT married EMPTY>
<!ATTLIST married
spouse IDREF #REQUIRED>
<!ELEMENT picture EMPTY>
<!ATTLIST picture
src CDATA #REQUIRED
width CDATA #IMPLIED
height CDATA #IMPLIED>
<!ELEMENT birthplace (city, (state|province), country)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT province (#PCDATA)>
<!ELEMENT country (#PCDATA)>
A document states what DTD it is using in its document type declaration. The DTD can be in another file on the same machine
<?xml version="1.0"> <!DOCTYPE people SYSTEM "mydts/people.dtd"> <people>...</people>
or at some other URI
<?xml version="1.0"> <!DOCTYPE people SYSTEM "http://someother.place.com/people.dtd"> <people>...</people>
or embedded directly within the XML document itself
<?xml version="1.0"> <!DOCTYPE people [ <!ELEMENT ...> ... ]> <people>...</people>
or part external and part internal
<?xml version="1.0"> <!DOCTYPE people SYSTEM "name.dtd" [ <!ELEMENT people (person*)> ]> <people>...</people>
Element content - #PCDATA, sequences, choice, ?, *, ",", grouping with parentheses, mixed content, EMPTY, ANY.
Attribute types - CDATA, NMTOKEN, NMTOKENS, Enumeration (in which each value must be a name token), ENTITY, ENTITIES, ID, IDREF, IDREFS, NOTATION.
Attribute defaults
A document that conforms to its DTD is valid. A document can be well formed but not valid. You can write a "validator".
The DTD can also contain entity declarations. In your document, or elsewhere in the DTD, you can make entity references to these entities.
A general entity is defined in the DTD
<!ENTITY notice "Copyright &#a9; 2003 Ticketmaster">
and referenced in the document
<footer>This program is ¬ice;</footer>
A parameter entity is defined in the DTD
<!ENTITY % weekdays "Mo|Tu|We|Th|Fr">
and also referenced in the DTD
<!ATTLIST meeting day (Su|%weekdays;|Sa) "Fr">
Special attributes:
xml:space - has either the value default
or preserve.
xml:lang - identifies the language used in this element.
Mixed Content: If the DTD contains a declaration like
<!ELEMENT A (#PCDATA | B | C)>
then A has a mixed content model and the content of A contains arbitrary character data with any number of B's and C's mixed in, in any order.
CDATA Sections: A CDATA section is used to "escape" characters which would otherwise be markup. They look like:
<![CDATA[Blah<Blah>Blah]]>
They can't nest — the first "]]>" ends the section!
Comments begin with <!-- and end with -->
and may not contain -- at all (except at part of the closer).
Example
<?php ......... ?>
Note the XML declaration is **NOT** a PI.
Often you might want to make a document by mixing content from two or more separate XML applications (e.g. XHTML + MathML + SVG + RDF, say). Element and attribute names may conflict! Namespaces can partition them.
A namespace is really just a URI, but you bind prefixes to it with a namespace declaration. A namespace declaration is NOT an attribute declaration; it only looks like one.
Here is an example, slightly modified from XML in a Nutshell by Harold and Means:
<?xml version="1.0"?>
<htm:html xmlns:htm="http://www.w3.org/1999/xhtml"
xmlns:xlink="http://www.w3.org/1999/xlink">
<htm:head><htm:title>Three Namespaces</htm:title></htm:head>
<htm:body>
<htm:h1>An ellipse and a rectangle</htm:h1>
<svg:svg xmlns:svg="http://www.w3.org/2000/svg" width="12cm" height="10cm">
<svg:ellipse rx="110" ry="130"/>
<svg:rect x="4cm" y="1cm" width="3cm" height="6cm"/>
</svg:svg>
<htm:p xlink:type="simple" xlink:href="ellipses.html">More about ellipses</htm:p>
<htm:p xlink:type="simple" xlink:href="rectangles.html">More about rectangles</htm:p>
</htm:body>
</htm:html>
Actually you really want to take advantage of namespace defaulting:
<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:xlink="http://www.w3.org/1999/xlink">
<head><title>Three Namespaces</title></head>
<body>
<h1>An ellipse and a rectangle</h1>
<svg xmlns="http://www.w3.org/2000/svg" width="12cm" height="10cm">
<ellipse rx="110" ry="130"/>
<rect x="4cm" y="1cm" width="3cm" height="6cm"/>
</svg>
<p xlink:type="simple" xlink:href="ellipses.html">More about ellipses</p>
<p xlink:type="simple" xlink:href="rectangles.html">More about rectangles</p>
</body>
</html>
If you're validating, you can get away without the default specifiers because the DTDs usually sp
XML documents should contain structure, not presentation.
Presentation is specified in a style sheet. Connect a style sheet
to an XML document with the xml-stylesheet
processing instruction. For example:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <?xml-stylesheet type="text/css" href="simple.css" ?> <!DOCTYPE person [ <!ELEMENT person (name,phone*)> <!ATTLIST person id CDATA #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT phone (#PCDATA)> ]> <person id="123456789"> <name>Alice</name> <phone>8005551212</phone> <phone>8885551212</phone> </person>
is connected to this stylesheet
name {
display: block;
font-size: 16pt;
font-weight: bold;
text-align: center;
color: white;
background-color: blue;
}
phone {
display: block;
font-size: 12pt;
text-align: left;
color: black;
background-color: pink;
}
and the result looks like this:

The main style languages are CSS and XSL-FO. You might have to first transform your XML before styling it, use XSLT for that.
Need more info? See