2010-12-18

More on MicroXML

There's been lots of useful feedback to my previous post, both in the comments and on xml-dev, so I thought I would summarize my current thinking.

It's important to be clear about the objectives. First of all, MicroXML is not trying to replace or change XML.  If you love XML just as it is, don't worry: XML is not going away.  Relative to XML, my objectives for MicroXML are:

  1. Compatible: any well-formed MicroXML document should be a well-formed XML document.
  2. Simpler and easier: easier to understand, easier to learn, easier to remember, easier to generate, easier to parse.
  3. HTML5-friendly, thus easing the creation of documents that are simultaneously valid HTML5 and well-formed XML.

JSON is a good, simple, extensible format for data.  But there's currently no good, simple, extensible format for documents. That's the niche I see for MicroXML. Actually, extensible is not quite the right word; generalized (in the SGML sense) is probably better: I mean something that doesn't build-in tag-names with predefined semantics. HTML5 is extensible, but it's not generalized.

There are a few technical changes that I think are desirable.

  • Namespaces. It's easier to start simple and add functionality later, rather than vice-versa, so I am inclined to start with the simplest thing that could possibly work: no colons in element or attribute names (other than xml:* attributes); "xmlns" is treated as just another attribute. This makes MicroXML backwards compatible with XML Namespaces, which I think is a big win.
  • DOCTYPE declaration.  Allowing an empty DOCTYPE declaration <!DOCTYPE foo> with no internal or external subset adds little complexity and is a huge help on HTML5-friendliness. It should be a well-formedness constraint that the name in the DOCTYPE declaration match the name of the document element.
  • Data model. It's a fundamental part of XML processing that <foo/> is equivalent to <foo></foo>.  I don't think MicroXML should change that, which means that the data model should not have a flag saying whether an element uses the empty-element syntax. This is inconsistent with HTML5, which does not allow these two forms to be used interchangeably. However, I think the goal of HTML5-friendliness has to be balanced against the goal of simple and easy and, in this case, I think simple and easy wins. For the same reason, I would leave the DOCTYPE declaration out of the data model.

Here's an updated grammar.

# Documents
document ::= comments (doctype comments)? element comments
comments ::= (comment | s)*
doctype ::= "<!DOCTYPE" s+ name s* ">"
# Elements
element ::= startTag content endTag
          | emptyElementTag
content ::= (element | comment | dataChar | charRef)*
startTag ::= '<' name (s+ attribute)* s* '>'
emptyElementTag ::= '<' name (s+ attribute)* s* '/>'
endTag ::= '</' name s* '>'
# Attributes
attribute ::= attributeName s* '=' s* attributeValue
attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"'
                 | "'" ((attributeValueChar - "'") | charRef)* "'"
attributeValueChar ::= char - ('<'|'&')
attributeName ::= "xml:"? name
# Data characters
dataChar ::= char - ('<'|'&'|'>')
# Character references
charRef ::= decCharRef | hexCharRef | namedCharRef
decCharRef ::= '&#' [0-9]+ ';'
hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'
namedCharRef ::= '&' charName ';'
charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'
# Comments
comment ::= '<!--' (commentContentStart commentContentContinue*)? '-->'
# Enforce the HTML5 restriction that comments cannot start with '-' or '->'
commentContentStart ::= (char - ('-'|'>')) | ('-' (char - ('-'|'>')))
# As in XML 1.0
commentContentContinue ::= (char - '-') | ('-' (char - '-'))
# Names
name ::= nameStartChar nameChar*
nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D]
                | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF]
                | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
# White space
s ::= #x9 | #xA | #xD | #x20
# Characters
char ::= s | ([#x21-#x10FFFF] - forbiddenChar)
forbiddenChar ::= surrogateChar | #FFFE | #FFFF
surrogateChar ::= [#xD800-#xDFFF]

42 comments:

Unknown said...

All right.

Given the differences between this and HTML5 (empty element syntax, arbitrary root element in doctype decl), what's the advantage of dropping processing instructions?

The advantage of keeping them is that they reduce the impetus to commit comment abuse. Give everyone a standard way to say "this is processor-specific", and they won't invent:



The various server-side web-page languages that make extensive use of processing instructions to embed code in pages (php et al) won't have to invent mutually-incomprehensible syntaxes (doesn't matter for HTML5; it doesn't see those parts of the page anyway, as a browser-focussed language).

I'd also like to see an example of multiple vocabularies in a single document. Is that still supported, here?

Unknown said...

Well, phooey.

&lt;--#include virtual="/bplate/nav.div" --&gt;

Double-escaping required for input of pointy brackets. Insert in the multiple blank lines in my post above.

Liam R E Quin said...

You may end up wanting to use XML Stylesheet processing instructions.

Your desiderata sound awfully similar to some goals for the Web SGML Working Group (some formal, some only informal), except change XML to SGML, and HTML to.. er... HTML. It might be that a microXML would have a significant impact on XML, although it's impossible ot guess right now.

John Cowan said...

Here are what I think the objectives for MicroXML should be:

0) Reaffirm the Ten Goals

1) A simple markup language whose syntax is a useful subset of XML syntax

2) A new specification independent of the XML 1.x specification

3) Documents are always well-formed XML, but not necessarily well-formed XML+Namespaces, and not necessarily with exactly the same infoset

4) Documents using the HTML+whatever vocabulary only are valid HTML

5) A single data model (like XPath, not like the Infoset)

6) A context-free mapping to and from XML 1.x+Namespaces

7) A mostly context-free mapping to and from JSON

Slogan: "To boldly go where no XML has gone before."

James Clark said...

Supporting processing instructions (PIs) would significantly complicate the data model. PIs require introducing a document object which has the root element and PIs as children. PIs also mean that elements have three kinds of children rather than two. Comments are not significant and so do not appear in the data model and do not add this kind of complexity.

You can still mix multiple vocabularies by using xmlns.

Anonymous said...

"XML is not going away." "easier to parse."

If XML is not going away, XML parsers are here to stay. When the parser is off-the-shelf software, it seems to me that ease of parsing isn't a good reason having a new format.

Anonymous said...

What about xlink?

Michael Kay said...

I'm inclined to think we need to deliver more added value to make this take off.

One way to do that is to eliminate not only the unwanted features of XML 1.0 but also some of the unwanted restrictions. Let's allow nested comments, multiple top-level elements, unescaped ampersands where unambiguous, unquoted attribute values where the value contains no spaces.

I think it would also be highly desirable to try and solve the "insignificant whitespace" problem: that is, it should be evident from the instance whether whitespace in the content of an element is significant. One approach might be to omit the ">" from the start tag in the case of element-only content: <a <b/> <c/> /a>

Unknown said...

Re: #7 Mostly context free mapping to/from JSON.

I dont see anything here that simplifies the JSON mapping. I wont expound here, but MicroXML Is nearly as difficult to map context-free to JSON as XML

Re Suggestions to diverge from XML Compatibility. I can see this as very tempting, but would think really hard about it. As soon as MicroXML becomes *not* parsable by XML parsers, Use will become much harder because it will *require* a MicroXML Parser or translator. I see this more of a barrier to entry then it solves by making the markup simpler. At that point why stop there ?

Unknown said...

I think there's a question not being asked - who are the customers for MicroXML? In general I don't think that it's necessarily the creators of XML - while there are a few of us geeks that write XML by hand, there are decent tools out there that simplify that process significantly.

It does make XPath easier, which might be a win if there was a huge percentage of developers who used XPath outside of the context of some other language, but most XPath is performed within the context of other languages - XForms, XSLT, XQuery, Schematron, etc., where knowledge of XML formalisms is a pre-requisite.

Simplification of XML within HTML5? Perhaps it would be easier to relax the rules on namespaces (possibly via some kind of declaration within the embedding HTML5) - but this simply extends the default namespace to include these subordinate schemas.

The syntactic oddities of HTML5 aside relative to XML, this still seems to me less a need to simplify XML as it does a need to simplify namespaces - which I believe would have a far more beneficial effect on the XML ecosystem.

Anonymous said...

Bonsoir,

MicroXML seems to be an "XML for dummies", an offering to the HTML5/JS/JSON community. It's a fact that many Web developers tend to ignore the advantages of XML technologies (including XPath in DOM, SVG, XML fields in RDBMS, XSLT or XQuery): I'm facing it as a Web tech director and XML teacher, therefore anything that cures the problem is a great move.

Note that many Web developers are neither aware of many HTML details, for instance how browsers process DOCTYPEs or the need to encode query string ampsersands in href attributes (true in both HTML and XML, but browsers are too kind). HTML5 writers should take this into account also: HTML5 could simply deprecate the use of the ugly DOCTYPE (SGML makes it mandatory, so what?) and create an html/@version, like most XML vocabulary do; a W3C primer could insist on the encoding of an ampersand. Another step in convergence would be to define the &apos; in HTML5. All this would make HTML5 and XML contenders go in the same direction, and toward KISS principles.

I'm comfortable with most Jame's proposals, but one = the namespace limitation. the "xmlns" attribute makes the trick for the inclusion of vocabs like SVG, but I see an issue with the inclusion of RDFa in HTML pages, i.e. for Google Rich Snippets which use a namespace (and xmlns:v) for attribute values (I know it's not the main use of namespaces, but it's RDFa's ... and Google). At a time when RDFa could make the Web 3.0 come real, I wouldn't like to see an initiative that breaks this movement.

James Clark said...

@LaurentLM

I had forgotten about &apos;. Thanks for mentioning it.

Namespaces are a tough problem. My current feeling is that microformats are a better fit for MicroXML than RDFa. If a user is sophisticated enough to cope with RDFa, then they can probably cope with full XML (and so use the XML syntax of HTML5).

David Carlisle said...

Another step in convergence would be to define the &apos; in HTML5.

This is done. (The html5 spec picks up all the entities in xhtml+mathml including apos.

Rick Jelliffe said...

I think you are talking syntax simplification but thinking API/infoset simplification. Why not just tackle the API/infoset issue head on?

For example, take the issue of PIs. If the problem is that you don't want an API with too many different kinds of nodes, why not just say "PIs and comments are not reported"?

Or, better, "there is a special metadata API call which reports PIs that occur before the first start-tag and after the last end-tag; all other PIs and comments are not reported by the API." This allows < ?xml-stylesheet ...?> etc while still allowing a simplicist tree.

You are not really defining uXML except to get a uDOM or uInfoset.

ashmind said...

I do see some of the points, but without namespaces and strict error handling I do not like the idea.

The primary reason for me to use XHTML instead of HTML would be to facilitate validation and post-processing. Not having strict error handling does not make sense in this context.

Unknown said...

MicroXML sounds really good (In fact at first sight seems to me what XML should have been). UTF-8 only is a good decision, BTW.

I would also get rid of named character entities. They require parsers to have a long table of names and asociated codes and do search and substitution. Use Unicode code point "numbers" and the parser will be simpler.

Actually I'd be tempted to get rid of numerical entities too. UTF-8 allows all characters to be represented, and if you whant characters to loose their XML meaning (<,>,",= ... in text), I would precede them with an escape character (such as "\") as is common in other formats.

Unknown said...

Oops. I see the grammar only allows a few named character refs: 'amp' | 'lt' | 'gt' | 'quot' | 'apos'

Not bad, then.

John Cowan said...

David Lee: It's definitely too awkward to try to round-trip MicroXML through JSON, but it's quite reasonable to be able to round-trip JSON through MicroXML in a standard way, defined up front. I think there are three plausible approaches:

1) Define a standard set of elements that represent JSON objects, arrays, numbers, booleans, and null. That's the simplest thing that could possibly work, and I incline to it.

2) Define a standard "json-type" attribute that you can add to arbitrary MicroXML elements to say what their JSON type is. Unmarked elements are strings. Some documents won't be convertible because they break JSON rules (a string in the root, e.g.)

3) Recognize both possibilities simultaneously.

John Cowan said...

There's discussion of a MicroXSD on xml-dev. Of course, full XSD and RNG will work fine with MicroXML, but I thought I would put together a MicroRNG as well, as something small enough to be readily packaged with a MicroXML parser. It does simple DTD-style structural validation and not much more, but that's most of what you need at the parser level: the rest can be performed by the application.

Stephen D Green said...

If the feature of MicroXML being able to match or map to JSON is achieved wouldn't this mean providing a way to type not only an element's text content but also an attribute's value without the need for a schema? JSON types are strings, numbers, booleans, object, arrays, and null, including variations of the array type. Maybe a reserved type attribute can be used (profiled perhaps) to declare the type of an element's text content within the instance but how would the same be done for an attribute's value (within the MicroXML instance)?

Stephen D Green said...

If you wanted to write a simple schema (XSD) using MicroXML (you might such a schema a MicroXSD schema), it seems to me that limiting MicroXML's namespace capabilities to just the xmlns attribute would mean you cannot include globals if there is no way to prefix the names of such elements, groups, attributes, etc in the XSD reference attribute. I guess in many cases it would be the simplest answer just to limit MicroXSD schemas to local elements, attributes and groups so maybe this is not enough of an issue to warrant the added pain of including namespace prefixes in MicroXML. It might be worth investigating and noting what are the main knock-on limitations this limitation in MicroXML places on any implementations of W3C XML Schema, etc.

John Cowan said...

Stephen D. Green: If MicroXML couldn't do more than JSON, there wouldn't be much point in having it. My idea is not to be able to map every MicroXML document to a unique JSON document, but rather:

1) to be able to map every JSON document to a unique MicroXML document and then recover the original JSON;

2) to be able to write MicroXML documents in such a way that they can be "downgraded" to JSON when needed, using the json-type attribute.

The so-called Java reference implementation of JSON provides such a mapping to XML, but the resulting XML isn't even guaranteed to be well-formed.

Unknown said...

John Cowan:
I think a key feature of JSON is that JSON data contains its own metadata to define/declare its datatypes (without the need for a schema). If MicroXML could match this then there would need to be a way to add corresponding datatype-related metadata to the XML, wouldn't there? If all of the JSON data is mapped to MicroXML elements then maybe that can be done using attributes to contain the type-related metadata. Could the MicroXML (mapped from some JSON) contain metadata about the type of an element's text content (without the use of a schema)? Might it need reserved attributes, etc?
But how would the MicroXML instance contain metadata about the type of an attribute's value (without a schema)?

DRRW said...

James, I'd like to see also how we can tie the OASIS CAM template work into this as well.

MicroXML and CAM seems like a very strong match...

We're getting ready to do CAM v2.0 in 2011 - so adding MicroXML support would be a natural.

Thanks, DW

DRRW said...

James, continuing the thought of MicroXML and OASIS CAM - have speed read through the comments - one thing I have long found troubling is the "let's cram everything into the instance" mentality that rapidly becomes angle bracket coplexity. By separating semantics into a template you can dramatically reduce what is being transferred on-the-wire to the bare content essentials and let the template then provide heavy lifting on parsing and interpretation nuances and even repetitive content structures and detail.

Someone mentioned audience for all this - I've been working with the NIEM community with CAM - targetting making everything dramatically simpler to do compared to XML Schema. When you give developers simpler more intuitive and robust ways to implement XML-based information exchanges - everyone wins.

Also simpler should not mean less capable. The trick here is to deliver simple yet strong functional capability that matches 95% of business information needs; more than an 80:20 approach - but less than 100% of all possible needs; its that striving to cover off the last 5% that adds 200% of the complexity. It's OK to say - in MicroXML we are just not going to worry about certain aspects of markup complexity - and catalogue those items as "not supported" so people understand the design limitations selected.

Thanks, DW

DRRW said...

Stephen, lets not tie ourselves to XML Schema and XSD! If you remove namespaces from the instances completely and use a CAM template as the (optional) way to add semantic context everyone wins! Then if you need to unravel what amounts to the dictionary side of XML - you reference the CAM template and it tells you contextually the semantics of the item you are interested in using XPath referencing and rules.

This decoupling is a no brainer in my opinion - and especially as CAM templates let you automatically generate domain dictionary catalogues of components, OWL and HTML5 forms directly - using XSLT transforms - that you cannot do from the XSD equivalent.

Stephen D Green said...

For those who tend to use XSD (W3C XML Schema) with XML and would like to use a very simplified version with MicroXML too, I've produced a cut-down MicroXSD 'schema of schemas'. It constrains XSD schemas for MicroXML to just 'local' elements and attributes (to avoid having to include more than one namespace and to avoid namespace prefixes). Within those constraints, the semantics would be the same as standard XSD. I've used the schema of schemas to generate random MicroXSD schemas using my favorite XML editor and subsequently used these random schemas to generate MicroXML-esque instances without any problems. It all makes MicroXML and a MicroXML stack including profiles for schema validation look quite feasible. 'Local' element and attribute definitions in an XSD schema have the advantage of making the schema look similar to the instances they constrain, I think (a bit like CAM and Examplotron).

DRRW said...

Stephen, now you're talking! Yes - with CAM templates - we're using xslt to write XSD for developers. This is a key feature for NIEM. The raw XSD for NIEM cause most developer tooling to crash - too much recursion - too much complexity. By writing the equivalent from CAM templates in simple XSD syntax - it avoids those issues - and logically the CAM and XSD are equivalent.

So back to where you are at - generating XSD schema for MicroXML from a CAM template using xslt can work nicely - and results in dramatic simpler tasking for developers - because they do not have to be XSD syntax experts.

Sine CAM is essentially WYSIWYG XML structurally - the MicroXML instance can plug straight in there - and the XSD be generated automatically. Even seems like cheating at times!

Michael said...

I cannot agree with the first paragraph. XML is indeed a good, simple, extensible format for documents. It has the enormous advantage of having large amounts of software tool infrastructure already in place. It's not perfect for all possible uses, but nothing is.

Kurt, MicroXML currently appears to be designed for people who want to build new parsers and software infrastructure, not for people who are creating, sharing, and reading documents. See for instance the difference in perspectives in the discussion of processing instructions in these comments. I agree that this does not seem like a compelling design rationale.

In the MusicXML world, for instance, UTF-16 encoding and processing instructions are absolutely necessary features. Who cares if these features may complicate life a little bit for parser implementers? The number of parser implementers is totally dwarfed by the number of people reading, writing, and sharing XML documents.

John Cowan said...

Stephen (not D. Green): The obvious answer to the attribute-type issue seems to me to be: Don't put data that needs to be JSON-visible into MicroXML attributes, or if you do, make sure it's string data only.

I've just posted my current thoughts on MicroXML and JSON.

John Cowan said...

On reflection, I think MicroXML should allow prefixed attributes. It's cleaner to have "json:type" than "json-type" in an attribute architecture, because you can have a "xmlns:json" attribute in the MicroXML to carry namespace information for XML processors without special hackery. MicroXML processors just see both "json:type" and "xmlns:json" as ordinary attributes with no magic properties, the same as "xmlns", and leave it up to the application to process them specially if at all.

John Cowan said...

I found an interesting IBM DeveloperWorks article by Parand Darugar called "Abolish XML namespaces?" Most anti-namespace screeds are just whinges, but this one isn't.

I've linked to the section about the two use cases where he considers namespaces actually worthwhile. The first is in identifying a document type, which is what MicroXML @xmlns is for (although it can be used at the top of a subtree as well). The second is namespaced attribute names, and I'll quote it selectively here:

Namespaces have a compelling use in providing unique identifiers for type information. You may have seen XML fragments such as: [...]

<cost xsi:type="xsd:float">29.95</cost>

This conveys that cost has a type, and by type I mean type as defined by XSI, and that the type is float as defined by XSD. The key point here is that you are indeed looking for unique, non-context-related identifiers for each type. You are not combining your document with the XSD or the SOAP encoding document; you are simply referring to particular elements within each specification from your document. The specification need not even be in XML — you are referring to a flat structure, simply a list of types. If you believed that the type structure was hierarchical, you would need to fully qualify the path for the type, with something like:

<cost xsi:type="xsd:/types/simple/float">29.95</cost>

[...]

Perhaps this still isn't a reasonable use case for XML namespaces, but you have glimpsed a certain amount of usefulness. The lesson can be generalized as follows: A method for associating the attributes of elements with external reference points might have value [italics in original]. The element itself does not need a namespace, but its attributes might.

Stephen D Green said...

Thinking about John's last comment, it seem to me that what makes namespaces useful to attributes is the potential use of foreign attributes. Attributes defined as part of a vocabulary do not need any additional namespace so they do not need a prefix (just a convention). Where something like 1 benefits from namespaces is when the attributes and maybe attribute values are from a foreign namespace, of course. Maybe would welcome, I think, a convention which says that such attributes need not be explicitly allowed by the vocabulary/namespace of the elements to be allowed in a MicroXML instance; the convention might be that they can just be ignored if their foreign namespace is not recognised in relation to the instance's namespace: But is that safe?

DRRW said...

Steve, ignoring foreign namespace attributes works great! Typically injection of attributes is a great way to pass processing instructions and directives that are not part of the data content. We use this in CAM for example to inject error, warning and informational parsing results. So we'd expect any handler to simply strip these out after digesting and acting on them.

Stephen D Green said...

The problem, I think, with namespaces for attributes involving prefixes is that this introduces the concept of QName and I thought one requirement for MicroXML is that it allows the XML to be treated very much like just text. A QName introduces the logic that the prefix part MUST match a prefix in a namespace declaration, doesn't it? That seems to be too much for something intended to be so simple.

John Cowan said...

QName introduces the logic that the prefix part MUST match a prefix in a namespace declaration, doesn't it?

Yes for XML with Namespaces, no for plain XML, and whatever we like for MicroXML. My proposal is to allow "foo:" in attribute names, but not to require "xmlns:foo" attributes to match them. If such an attribute is present, great; your document is XML with Namespaces compatible. In any case, attribute names are still just strings in the data model, whether they contain ":" or not.

DRRW said...

Steve, who saids you need to declare namespaces?! If something has a ns: prefix to its name - then assume it is processing instructions or other semantics, separate from the data content, otherwise ignore. Simple.

Stephen D Green said...

John, David,
Yes but wasn't the beauty of allowing the xmlns="foo" namespace attribute that it preserved compatibility with XML Namespaces. Allowing prefixes in attributes, worthy though it's purpose may be, seems to me to be very costly if it breaks compatibility.

Stephen D Green said...

My previous concern that prefixes require declarations ("xmlns:foo") is answered I now see by the above MicroXML limit to just the special (according to the XML Namespaces spec) prefix "xml:" which the said spec says doesn't need a declaration. Maybe we need a list of all the possible xml: values and uses (or a subset of them?) and requirements for implementing them (like not rejected them when mixed with another vocabulary/model).

John Cowan said...

I have released MicroLark 0.8, a parser/writer/tree-model package for MicroXML written in Java. It implements MicroXML as specified in this post, with the addition of prefixed attribute elements (it allows, but does not require, declarations of those prefixes).

John Cowan said...

I've put together a very preliminary MicroXML draft specification.

Anonymous said...

James, how would MicroXML fit in with Efficient XML Interchange (EXI)?