There's been a lot of discussion on the xml-dev mailing list recently about the future of XML.  I see a number of different possible directions.  I'll give each of these possible directions a simple name:

  • XML 2.0 - by this I mean something that is intended to replace XML 1.0, but has a high degree of backward compatibility with XML 1.0;
  • XML.next - by this I mean something that is intended to be a more functional replacement for XML, but is not designed to be compatible (however, it would be rich enough that there would presumably be a way to translate JSON or XML into it);
  • MicroXML - by this I mean a subset of XML 1.0 that is not intended to replace XML 1.0, but is intended for contexts where XML 1.0 is, or is perceived as, too heavyweight.

I am not optimistic about XML 2.0. There is a lot of inertia behind XML, and anything that is perceived as changing XML is going to meet with heavy resistance.  Furthermore, backwards compatibility with XML 1.0 and XML Namespaces would limit the potential for producing a clean, understandable language with really substantial improvements over XML 1.0.

XML.next is a big project, because it needs to tackle not just XML but the whole XML stack. It is not something that can be designed by a committee from nothing; there would need to be one or more solid implementations that could serve as a basis for standardization.  Also given the lack of compatibility, the design will have to be really compelling to get traction. I have a lot of thoughts about this, but I will leave them for another post.

In this post, I want to focus on MicroXML. One obvious objection is that there is no point in doing a subset now, because of the costs of XML complexity have already been paid.  I have a number of responses to this. First, XML complexity continues to have a cost even when XML parsers and other tools have been written; it is an ongoing cost to users of XML and developers of XML applications. Second, the main appeal of MicroXML should be to those who are not using XML, because they find XML overly complex. Third, many specifications that support XML are in fact already using their own ad-hoc subsets of XML (eg XMPP, SOAP, E4X, Scala). Fourth, this argument applied to SGML would imply that XML was pointless.

HTML5 is another major factor. HTML5 defines an XML syntax (ie XHTML) as well as an HTML syntax. However, there are a variety of practical reasons why XHTML, by which I mean XHTML served as application/xml+xhtml, isn't common on the Web. For example, IE doesn't support XHTML; Mozilla doesn't incrementally render XHTML.  HTML5 makes it possible to have "polyglot" documents that are simultaneously well-formed XML and valid HTML5.  I think this is potentially a superb format for documents: it's rich enough to represent a wide range of documents, it's much simpler than full HTML5, and it can be processed using XML tools. There's an W3C WD for this. The WD defines polyglot documents in a slightly different way, requiring them to produce the same DOM when parsed as XHTML as when parsed as HTML; I don't see much value in this, since I don't see much benefit in serving documents as application/xml+xhtml.  The practical problem with polyglot documents is that they require the author to obey a whole slew of subtle lexical restrictions that are hard to enforce using an XML toolchain and a schema language. (Schematron can do a bit better here than RELAX NG or XSD.)

So one of the major design goals I have for MicroXML is to facilitate polyglot documents.  More precisely the goal is that a document can be guaranteed to be a valid polyglot document if:

  1. it is well-formed MicroXML, and
  2. it satisfies constraints that are expressed purely in terms of the MicroXML data model.

Now let's look in detail at what MicroXML might consist of. (When I talk about HTML5 in the following, I am talking about its HTML syntax, not its XML syntax.)

  • Specification. I believe it is important that MicroXML has its own self-contained specification, rather being defined as a delta on existing specifications.
  • DOCTYPE declaration. Clearly the internal subset should not be allowed.  The DOCTYPE declaration itself is problematic. HTML5 requires valid HTML5 documents to start with a DOCTYPE declaration.  However, HTML5 uses DOCTYPE declarations in a fundamentally different way to XML: instead of referencing an external DTD subset which is supposed to be parsed, it tells the HTML parser what parsing mode to use.  Another factor is that almost the only thing that the XML subsets out there agree on is to disallow the DOCTYPE declaration.  So my current inclination is to disallow the DOCTYPE declaration in MicroXML. This would mean that MicroXML does not completely achieve the goal I set above for polyglot documents. However, you would be able to author a <body> or a <section> or an <article> as MicroXML; this would then have to be assembled into a valid HTML5 document by a separate process (albeit a very simple one). It would be great if HTML5 provided an alternate way (using attributes or elements) to declare that an HTML document be parsed in standards mode. Perhaps a boolean "standard" attribute on the <meta> element?
  • Error handling. Many people in the HTML community view XML's draconian error handling as a major problem.  In some contexts, I have to agree: it is not helpful for a user agent to stop processing and show an error, when a user is not in a position to do anything about the error. I believe MicroXML should not impose any specific error handling policy; it should restrict itself to specifying when a document is conforming and specifying the instance of the data model that is produced for a conforming document. It would be possible to have a specification layered on top of MicroXML that would define detailed error handling (as for example in the XML5 specification).
  • Namespaces. This is probably the hardest and most controversial issue. I think the right answer is to take a deep breath and just say no. One big reason is that the HTML5 does not support namespaces (remember, I am talking about the HTML syntax of HTML5). Another reason is that the basic idea of binding prefixes to URIs is just too hard; the WHATWG wiki has a good page on this. The question then becomes how does MicroXML handle the problems that XML Namespaces addresses. What do you do if you need to create a document that combines multiple independent vocabularies? I would suggest two mechanisms:
    • I would support the use of the xmlns attribute (not xmlns:x, just bare xmlns). However, as far as the MicroXML data model is concerned, it's just another attribute. It thus works in a very similar way to xml:lang: it would be allowed only where a schema language explicitly permits it; semantically it works as an inherited attribute; it does not magically change the names of elements.
    • I would also support the use of prefixes.  The big difference is that prefixes would be meaningful and would not have to be declared.  Conflicts between prefixes would be avoided by community cooperation rather than by namespace declarations.  I would divide prefixes into two categories: prefixes without any periods, and prefixes with one or more periods.  Prefixes without periods would have a lightweight registration procedure (ie a mailing list and a wiki); prefixes with periods would be intended for private use only and would follow a reverse domain name convention (e.g. com.jclark.foo). For compatibility with XML tools that require documents to be namespace-well-formed, it would be possible for MicroXML documents to include xmlns:* attributes for the prefixes it uses (and a schema could require this). Note that these would be attributes from the MicroXML perspective. Alternatively, a MicroXML parser could insert suitable declarations when it is acting as a front-end for a tool that expects an namespace well-formed XML infoset.
  • Comments. Allowed, but restricted to be HTML5-compatible; HTML5 does not allow the content of a comment to start with -or ->.
  • Processing instructions. Not allowed. (HTML5 does not allow processing instructions.)
  • Data model.  The MicroXML specification should define a single, normative data model for MicroXML documents. It should be as simple possible:
    • The model for a MicroXML document consists of a single element.
    • Comments are not included in the normative data model.
    • An element consists of a name, attributes and content.
    • A name is a string. It can be split into two parts: a prefix, which is either empty or ends in a colon, and local name.
    • Attributes are a map from names to Unicode strings (sequences of Unicode code-points).
    • Content is an ordered sequence of Unicode code-points and elements.
    • An element probably also needs to have a flag saying whether it's an empty element. This is unfortunate but HTML5 does not treat an empty element as equivalent to a start-tag immediately followed by an end-tag: elements like <br> cannot have end-tag, and elements that can have content such as <a> cannot use the empty element syntax even if they happen to be empty. (It would be really nice if this could be fixed in HTML5.)
  • Encoding. UTF-8 only. Unicode in the UTF-8 encoding is already used for nearly 50% of the Web. See this post from Google.  XML 1.0 also requires support for UTF-16, but UTF-16 is not in my view used sufficiently on the Web to justify requiring support for UTF-16 but not other more widely used encodings like US-ASCII and ISO-8859-1.
  • XML declaration. Not allowed. Given UTF-8 only and no DOCTYPE declarations, it is unnecessary. (HTML5 does not allow XML declarations.)
  • Names. What characters should be allowed in an element or attribute name? I can see three reasonable choices here: (a) XML 1.0 4th edition, (b) XML 1.0 5th edition or (c) the ASCII-only subset of XML name characters (same in 4th and 5th editions). I would incline to (b) on the basis that (a) is too complicated and (c) loses too much expressive power.
  • Attribute value normalization. I think this has to go.  HTML5 does not do attribute value normalization. This means that it is theoretically possible for a MicroXML document to be interpreted slightly differently by an XML processor than by a MicroXML processor.  However, I think this is unlikely to be a problem in practice.  Do people really put newlines in attribute values and rely on their being turned into spaces?  I doubt it.
  • Newline normalization. This should stay.  It makes things simpler for users and application developers.  HTML5 has it as well.
  • Character references.  Without DOCTYPE declarations, only the five built-in character entities can be referenced. Things could be simplified a little by allowing only hex or only decimal numeric character references, but I don't think this is worthwhile.
  • CDATA sections. I think best to disallow. (HTML5 allows CDATA sections only in foreign elements.) XML 1.0 does not allow the three-character sequence ]]> to occur in content. This restriction becomes even more arbitrary and ugly when you remove CDATA sections, so I think it is simpler just to require > to always be entered using a character reference in content.

Here's a complete grammar for MicroXML (using the same notation as the XML 1.0 Recommendation):

# Documents
document ::= (comment | s)* element (comment | s)*
element ::= startTag content endTag
          | emptyElementTag
content ::= (element | comment | dataChar | charRef)*
startTag ::= '<' name (s+ attribute)* s* '>'
emptyElementTag ::= '<' name (s+ attribute)* s* '/>'
endTag ::= '</' name s* '>'
# Attributes
attribute ::= name s* '=' s* attributeValue
attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"'
                 | "'" ((attributeValueChar - "'") | charRef)* "'"
attributeValueChar ::= char - ('<'|'&')
# Data characters
dataChar ::= char - ('<'|'&'|'>')
# Character references
charRef ::= decCharRef | hexCharRef | namedCharRef
decCharRef ::= '&#' [0-9]+ ';'
hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'
namedCharRef ::= '&' charName ';'
charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'
# Comments
comment ::= '<!--' (commentContentStart commentContentContinue*)? '-->'
# Enforce the HTML5 restriction that comments cannot start with '-' or '->'
commentContentStart ::= (char - ('-'|'>')) | ('-' (char - ('-'|'>')))
# As in XML 1.0
commentContentContinue ::= (char - '-') | ('-' (char - '-'))
# Names
name ::= (simpleName ':')? simpleName
simpleName ::= nameStartChar nameChar*
nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D]
                | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF]
                | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
# White space
s ::= #x9 | #xA | #xD | #x20
# Characters
char ::= s | ([#x21-#x10FFFF] - forbiddenChar)
forbiddenChar ::= surrogateChar | #FFFE | #FFFF
surrogateChar ::= [#xD800-#xDFFF]


Will Gilreath said...

Hi James:

I found your blog post interesting. Do you have any examples of the before and after (my apologies for the allusion to the old commercials that talk about losing weight, or getting rid of wrinkles)? I always thought XML 1.0 was "bloated" and needed simplification.

Thanks, my best!

William Gilreath

James Clark said...

See also discussion on xml-dev.

David said...

Another factor is that almost the only thing that the XML subsets out there agree on is to disallow the DOCTYPE declaration

They may be phrased that way but mostly they want to get rid of dtd. I would have thought that allowing a doctype with no internal or external subset would be OK and make things easier wrt html5, so

<!DOCTYPE foo>

would be well formed.

James Clark said...

For some reason, I thought that either an external or internal subset was required, but checking the XML Rec again, I see that it is allowed to have neither. That definitely swings me in the direction of allowing just this kind of DOCTYPE declaration.

One downside is that it complicates the data model. You now need a separate root or document object (unless you require the DOCTYPE declaration).

Sean said...

Its great to see this initiative take shape: http://goo.gl/48w5t.

As for what should be in it (such as doctype) I believe that anything that complicates the data model for MicroXML processors needs to bring serious benefit in order to be worth it.

For me, the most important attribute of an XML 1.0 subset is simple null-transformation through a simple data model. I.e. can I read it in, and write it back out again in a way that does not result in any surprises on the output side.

Uche Ogbuji said...

On first impression, I love this proposal. There are plenty of little bits and bobs to discuss further, but overall, it's weighted marvelously. I do want to point out that the Mozilla incremental parse issue was resolved a few years ago, as available in FF 3.5 and up


But your point remains that the fact that this problem persisted so long had already doen the damage.

John Cowan said...

On paying the price: your SGML parsers and other tools paid the price for the C/C++ ecology, but not elsewhere. It was precisely because Perl and Java and C# and Ruby and Smalltalk hackers were willing to write XML parsers that the full price has been paid for XML. That said, I do think the point that "XML needs to go where no *ML has gone before" is a good one.

I agree about the separate specification. If there is no objection, I'll discuss this with the XML Core WG, and it might become a work item.

I think an empty DOCTYPE declaration should be optional but not part of the data model (the root element name is just leftover cruft from SGML anyhow, where it actually mattered). MicroXML parsers would accept one and verify the name match, and MicroXML generators would generate it or not depending on their parameters, just like single vs. double quotes, whitespace within attributes, etc.

If there are no namespace declarations, then there needs to be a round-trip mapping between MicroXML and namespace-well-formed XML. In TagSoup, an undeclared prefix "foo" is currently mapped to the namespace name "urn:x-prefix:foo". It would be straightforward to write an RFC defining the URN scheme "xmlns-prefix" so that we could use "urn:xmlns-prefix:foo".

According to the HTML5 FAQ, inherently-empty elements may be written using start-tag syntax or empty-tag syntax. I have to think more about the question of emptiness.

I have taken advantage of attribute value normalization for attributes of type list { anyURI }, as an ad hoc folding measure. I agree that this is mostly an edit/display issue, though, and I wouldn't be troubled to see it go.

Note that although 50% of the Web's documents are UTF-8, another 20% are ASCII, which are also UTF-8 by definition. So the true breakdown is: UTF-8 70%, ISO 8859-1 and friends 20%, all else 10%. UTF-16 is about 0.02%.

For simplicity and uniformity, I would disallow > in attribute values as well as character content.

I think there needs to be a mapping to JSON. The simplest approach is just to say that JSON documents are represented in MicroXML using elements named object, array, number, boolean, string, and null. More cleverly, the mapping can say that elements with a json:type attribute of object etc. are mapped to the appropriate JSON values, making JSON an architecture of MicroXML. As a third alternative, use xsi:type and xsi:nil attributes, giving XML Schema access to the JSON types.

AngleBracket said...

> Attribute value normalization...
> "Do people really put newlines in attribute
> values and rely on their being turned into spaces?
> I doubt it."

They may not put them in themselves, but pretty much anyone using Emacs/psgml/xxml will find attribute values containing spaces being wrapped to newlines and TABs whether they want it or not. They don't put them there: they get put in for them.

But normalization should be a function of the processing (eg XPath normalize-space()), not a function of the spec.

As you just posted on c.t.t, this is a data-oriented proposal; as a document-head, I'll carry on using XML 1.x, and just generate MicroXML-or-whatever as and when I need to.


Anonymous said...

Let's call it TinyXML to stay in line with SVG. ;)

John Cowan said...

(Feel free to delete the duplicated comment.)

Another point: #xD should be removed from the definition of whitespace, because newline normalization changes explicit CRs into LFs anyway. The only place in which it matters in XML 1.x is in attribute normalization, which isn't present in MicroXML anyhow.

(Personal note: Years ago, when I mentioned to James that #xD never needed to be escaped on output, he showed me this counterexample. This led me to formulate the Law of James Clark.)

Anonymous said...

This is just irritating. Please leave XML as it is today.

Leigh said...

We've had quite a bit of success with XForms implementations in the browser by using XHTML+XForms markup to provide MVC (data binding, logic layer, presentation layer, submission, events, etc) and then in-browser implementations such as JavaScript DOM walking (Ubiquity XForms) or in-browser XSLT PI (AgenceXML XSLTForms). The run-time then leverages the browser components (JavaScript, HTML4, HTML5, SVG, what-have-you) but gives you a clean model-based approach.

Right now this relies on namespaces to separate the vocabularies of XForms and XHTML; either prefixes (xf:input, xf:submission) or default ns (...).

It'd be nice to have a way of authoring mixed-vocabulary documents where this technical approach to converting markup into working code still works. I'm not stuck on namespaces, but concerned that the extensibility I see in HTML5 is limited to "Oh, we like SVG so it's OK". The DTD-based approach outlined here seems to enshrine that conviction further.

Leigh Klotz
Co-Chair W3C Formd Working Group

uche said...

Oh dear. Re: what AngleBracket said said, I'm not interested in a data XML. I use JSON for that. I'm also a doc head, and I think XML simplification can be suitable for doc heads as well. But if data-orientation becomes an explicit goal, I worry about the effect on MicroXML, at least from my perspective.

James Clark said...

I agree. I don't see anything data-oriented about MicroXML. The niche I see for MicroXML is a simple format for documents. As you say, JSON already makes a nice, simple format for data.

James Clark said...

@John Cowan
I agree with most of your comments, but I'm not convinced about disallowing > in attributes.

One reason is compatibility with existing XML serializers. My guess is that most existing XML serializers, given an XML document that could be serialized as MicroXML, will create well-formed MicroXML. I suspect disallowing > in attributes would break this. At least this is true of the serializers I've written.

Another reason is compatibility with canonical XML. At the moment any canonical XML document is well-formed MicroXML, provided it's infoset is representable as MicroXML. Canonical XML requires the use of > in attributes.

Anonymous said...

In the spirit of your decision on namespaces, why not forego doctypes altogether, and simply have each element specify it's type metadata with standard attributes (xmlns, schema/doctype, etc)?

I think it would be beneficial in the sense that this would allow a document to specify certain "chunks" of itself as specific types of metadata, thus making it easier to stitch XML fragments together into larger documents. It would also simplify the parser by removing a special case, at the possible cost of breaking backwards compatibility if you remove the ability to start an element's name with an exclamation mark.

I mean, if you are going to toss away the concept of strict parsing/error-handling, then there's no need to specify a doctype upfront. You are really only using it to provide metadata that's not used for parsing, but rather for validation or to imply intent.

Anonymous said...

It is hard not to sympathize with every item on the list as well as every comment, having experienced these pains and more at one point or another.

But the question is, can this sort of effort become usable? At heart I was always really a fan of the whole XML spec, so "This is just irritating. Please leave XML as it is today." feels at least as valid as other comments.

If you are going to seriously propose a subset of XML with use that is wide-spread enough to be significant, you need to avoid the attitude that people can choose to like your own particular subset, or just live with XML classic as it is today. Or at that point, you are just another person defining a private convenient subset that will not reduce the pain.

You have to make this new subset its own normative/canonical XML such that that anyone can reasonably subscribe to with a way of mapping virtually anything down to that level, including even such things as:

Starting with the most unpopular thing that I use frequently: Parameter/general/external entities. The only thing bad about them is the non-standard syntax that seems to make people hate them and the way they are inflexibly married into the processing model. I like what Java communities do as an alternative with ${foo} style substitutions, but it needs to be integrated before the validation is applied and spoiled by unsubstituted values, and this sort of thing needs to be standardized or entities continue to look attractive for this sort of operation. XInclude for external entities is fine if I can get them when I need them, but I have yet to see them supported anywhere I need them, whereas external entities are usually everywhere I need them, i.e. breaking up a monolithic XML configuration file of an arbitrary program without having to rewrite the program.

Namespaces: I don't see how you avoid in-document prefix declarations, but as long as they are kept confined to the root element, are they so bad? It is not uncommon to have a schema that is a combination of specific parts from one namespace and common parts from at least one common library schema. Do I really have to register my organization's common prefix so that I can draw some elements from a common set in a separate module using a prefix, as opposed to having to set a new default xmlns every time I use it or resort to reverse domain verbosity every time I want to draw from the new namespace?

What is so harmful about processing instructions as long as those who do not use them are free to ignore them? It is great that most processors DO ignore them as comments. Languages gain this sort of declarative feature (i.e. Java annotations or XML Schema annotations), because it feels silly to start putting this sort of information into comments everywhere (like in C++) and make comments part of the processing model. How do you get rid of the feature without resolving the need? Is it just the syntax you object to? Are you going to provide us with something to accomplish the same thing, i.e. a standard prefix that is tolerated or ignored by most validators that only want to enforce real structure?

As much as we all tend to hate attribute normalization, who puts things into attributes these days that suffer under normalization? It may make as much sense to eliminate attributes altogether. Why do we need this duality of elements and attributes without any commonly adopted standard of when to use one or the other? Only because of mistakes made in the definition of child elements such as not supporting a shortened form for marking the end of content (I think SGML may have had which is a step in the right direction) and the difficulty of distinguishing (unordered) structural content from sequential content, but those are problems anyway when child content occurs that for other reasons does not fit well into attributes.

Can you define the exact motivation which justifies your decisions and the mapping from more general XML use cases, instead of just for the common good of some unspecified group, without any detail.

ormaaj said...

I pretty much agree with Anne Kesteren's blog post - if it ain't broke, don't fix it. What problem is solved by defining a subset that does nothing but restrict functionality? Nobody said you have to use or even implement every obscure extension. Document authors are already free to use whatever subset they choose. Of course XML is going to be huge and complicated; it's eXtensible. In a nutshell, I'm all for relaxing hard requirements, depreciating cruft conservatively, but very against imposing limitations on the existence of features that do no harm. The solution is to keep the core "Must" parts of the language small, but should avoid "Must Nots", which is what MicroXML seems to be adding a whole lot of.

The only criticism of XML for use in document markup that holds any water is the mandatory well-formedness behavior. Everybody knows that in contexts where the point isn't to forbid the user from accessing their data, the correct behavior would be to warn and optionally try doing something sensible to display what's available. None of that is any reason to throw the baby out with the bath water and would be easy to fix from within XML without breaking applications where strict parsing is critical as a parity check.

Namespaces aren't confusing or unusual. People are used to "import x as y" in Python and Haskell; I think we can handle namespaces. They are an important mechanism to extensibility when delimiting multiple languages. The wrong thing to do would be to rip it out and replace it with another incompatible mechanism which does the exact same thing.

There's also not much point in forcing people not to use doctypes in the XML spec. The more serious problem is that all versions of XHTML up until the most recent working drafts of XHTML+RDFa mandate doctype use for some reason. Being required to at least extend or modify a schema language without namespace support (DTD) in addition to possibly some other more sane language you actually want to use for validation sort of defeats the purpose of modularization as a way of easing customization and extensibility.

Remove processing instructions? Again, if you don't like them, don't use them. I think the addition of <?xml-model?> might help the aforementioned modularization problems similarly to NVDL.

John Cowan said...

ormaaj: MicroXML has just as much extensibility at the instance level as XML 1.x does.

jjc: Okay, I'm convinced by the serialization argument to allow > in attribute values.

Uche: I think MicroXML as outlined is just as suitable for documents as for data, and JSON is too simple for some data. In any case, a suitable mapping between the two would be an unambiguous Good Thing.

Anonymous: The cost of full XML is not to document authors, who can indeed leave out any feature they don't need, but to library, tool, and application programmers, who cannot. There is also a cost in understanding. XML can't do SUBDOCs as SGML could, but the XML definition is something like 10% the size of the SGML definition. I doubt if MicroXML can be defined in five pages, but it would be interesting to try.

jakob said...

You should mention Tim Bray's XML-SW from 2002. Back then some people agreed but nobody cared, why should it be different today. Either you try to live with XML as great and broken as it is, or you choose some other language. Before XML it was ASN.1 and SGML, yesterday it was XML, today it is JSON and RDF, and the day after tomorrow something else. Every new language promises a revolution, but sooner or later it always evolves into a complex monster and other languages are proposed. This is just evolution.

Peter Rushforth said...

Hi James,

Sorry I haven't followed MicroXML too much till now.

I've recently been made aware that XML is not a hypermedia format - that is, has no hypermedia affordances.

If MicroXML had those, and they were backward compatible with XML, maybe it would solve some issues on the web.

Thanks for thinking about this.