XML

XML is the Extensible Markup Language and is the grandchild of GML, the Generalised Markup Language started by IBM in 1973. GML was simplified to SGML. XML and HTML were based on SGML. Current common uses for XML include the documents produced by LibreOffice, OpenOffice, and Microsoft Office. See OASIS Standards for open standards.

I used a simplification of GML for data passed from a large scale online system to a huge data repository. SGML headed in the same direction and was too generic for most uses. XML hit the right combination of simplicity and flexibility for most uses. XML replaced most of the earlier attempts at packaging data for movement between systems.

XML lets you define a scope for your data in a DTD, a Document Type Definition. The DTD defines what you can and cannot include. The DTD is not needed when reading XML data, you assume the file has the correct range of data and reject the data, or the whole file, when there is anything else included. The DTD can be used when creating an XML file to restrict what goes into the file or the DTD can be used by a programmer to construct code to create the right format file.

XML files can be compressed as they are text. XML can be repetitive and common compression routines are good at reducing the space used by the repetition.

When you read the XML document, you read it as a document or a transaction type file. A document is read as one whole file, decoded into one XML structure, then used based on the final result of the decoding.

An XML transaction file has lots of rows, or transactions, that can be processed as they are read. You could also have a stream of rows/transactions with no end. The overall file or stream has some XML at the start and end to define the file or stream. Each row/transaction is a little bit of XML in the format indicated by the XML at the start of the file/stream. Each row/transaction might trigger an update to a system.

The decoding continues until there is an error then stops or skips the error. For a stop, you have code to restart where the decoding stopped. For a skipped row in a continuous stream, you might put the skipped row in a separate log where a program or a person can manually process the skipped row.

Before XML

Before XML, people used formats like CSV, Comma Separated Variable, but they worked mostly for simple lists. XML works for both simple lists and complicated structures.

CSV based software often made stupid changes, like replacing the comma with a tab. We ended up with a mess of incompatible "CSV" files.

After XML

The JSON format is promoted as a replacement for XML but JSON works reliably only for simple data. JSON does stupid things including misusing whitespace as a formatting element, which means the slightest editing can really screw up the JSON data.

The current main use of JSON is to send data from a Web site to a Javascript based front end and accept data back to the server. Both Javascript and PHP have JSON processors that work together.

How does XML work?

If you have created HTML, you have created something similar to XML. HTML uses elements defined by start and end tags, the same as XML. A tag starts with a less than sign, <, and ends with a greater than sign, >. XML uses the same convention because it is a SGML convention.

Like HTML, XML elements have a start and an end tag. In HTML, a paragraph is indicated by <p></p>. A paragraph containing "Hello." would be <p>Hello.</p>. XML uses the same idea of a start and end tag. When you look in files from LibreOffice, you see paragraphs that look like a HTML paragraph with many more options.

XML also accepts single elements similar to the HTML <br />. Almost everything in HTML is accepted in XML but HTML processing is less strict so that a Web browser is less likely to fail when a Web site provides damaged HTML. XML processors report everything as an error.

XML lets you call an element anything so long as the name is consistent within a file. As an example, a paragraph could be <p></p> or <par></par> or <paragraph></paragraph>. When you formally define a file format in a DTD, the element name is locked to whatever is in the DTD.

XML elements can have attributes. If element abc can have a weight attribute, you can write something like <abc weight="5"></abc>. Early versions of HTML weres XML with a small amount of non XML. XHTML was HTML defined as valid XML. HTML5 is back to an SGML format that is not valid XML. XHTML required something like <abc selected="selected"> while HTML5 allows <abc selected>.

What XML does not do

XML does not define what your data does. Assume you have an element that describes the weight of an item. You have the following XML.

<item><name>statue</name><weight>20</weight></item>

What does the 20 in weight mean? Kilograms, grams, tonnes, pounds? You can expand your XML something like the following, with an attribute defining the type of weight.

<item><name>statue</name>< type="kilograms">20</weight></item>

Adding the attribute "type" works when the file contains many items and each item can have a different type of weight. If every weight is always kilograms, you can leave out the type attribute and define the weight as always kilograms. If you create a DTD, you can put that type of information in your DTD.

You can also add specifications at the start of an XML document or file. The following example is a file full of items. The "weights" element is defined before the items so that the item processor knows what type of weight is used in that file.

<file>
<weights>kilograms</weights>
<items>
<item><name>statue</name><weight>20<></item>
<item><name>door</name><weight type="kilograms">20</weight></item>
</items>
</file>

PHP XML modules

PHP provides several modules for reading XML. You decide to read the XML as a one off document or to process the XML as a long running file or as a stream. There is more than one PHP option for each choice.

The PHP XML Parser module is the most reliable code for small to medium files.

PHP provides the DOM module to read XML as a document, not a long running file or stream.

SimpleXML works for reliable XML sources and fails with errors that will pass through XML Parser.

XML Reader reads from an XML stream and XML Writer writes to an XML stream. Each row/transaction is called a node.

XSL reads an XSLT file, a template, then converts an XML file from one XML format to another XML format, based on the XSLT. An XSLT is easy to create for simple conversions and difficult to create for anything else.

PHP uses the libxml library of code for most XML coding and decoding in the DOM module and most of the other PHP XML modules. You can use libxml direct for decoding XML errors.

Read more about PHP XML modules in PHP XML options.

Read more