It's important to be clear about the objectives. First of all, MicroXML is not trying to replace or change XML. If you love XML just as it is, don't worry: XML is not going away. Relative to XML, my objectives for MicroXML are:
- Compatible: any well-formed MicroXML document should be a well-formed XML document.
- Simpler and easier: easier to understand, easier to learn, easier to remember, easier to generate, easier to parse.
- HTML5-friendly, thus easing the creation of documents that are simultaneously valid HTML5 and well-formed XML.
JSON is a good, simple, extensible format for data. But there's currently no good, simple, extensible format for documents. That's the niche I see for MicroXML. Actually, extensible is not quite the right word; generalized (in the SGML sense) is probably better: I mean something that doesn't build-in tag-names with predefined semantics. HTML5 is extensible, but it's not generalized.
There are a few technical changes that I think are desirable.
- Namespaces. It's easier to start simple and add functionality later, rather than vice-versa, so I am inclined to start with the simplest thing that could possibly work: no colons in element or attribute names (other than xml:* attributes); "xmlns" is treated as just another attribute. This makes MicroXML backwards compatible with XML Namespaces, which I think is a big win.
- DOCTYPE declaration. Allowing an empty DOCTYPE declaration <!DOCTYPE foo> with no internal or external subset adds little complexity and is a huge help on HTML5-friendliness. It should be a well-formedness constraint that the name in the DOCTYPE declaration match the name of the document element.
- Data model. It's a fundamental part of XML processing that <foo/> is equivalent to <foo></foo>. I don't think MicroXML should change that, which means that the data model should not have a flag saying whether an element uses the empty-element syntax. This is inconsistent with HTML5, which does not allow these two forms to be used interchangeably. However, I think the goal of HTML5-friendliness has to be balanced against the goal of simple and easy and, in this case, I think simple and easy wins. For the same reason, I would leave the DOCTYPE declaration out of the data model.
Here's an updated grammar.
# Documents document ::= comments (doctype comments)? element comments comments ::= (comment | s)* doctype ::= "<!DOCTYPE" s+ name s* ">" # Elements element ::= startTag content endTag | emptyElementTag content ::= (element | comment | dataChar | charRef)* startTag ::= '<' name (s+ attribute)* s* '>' emptyElementTag ::= '<' name (s+ attribute)* s* '/>' endTag ::= '</' name s* '>' # Attributes attribute ::= attributeName s* '=' s* attributeValue attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"' | "'" ((attributeValueChar - "'") | charRef)* "'" attributeValueChar ::= char - ('<'|'&') attributeName ::= "xml:"? name # Data characters dataChar ::= char - ('<'|'&'|'>') # Character references charRef ::= decCharRef | hexCharRef | namedCharRef decCharRef ::= '&#' [0-9]+ ';' hexCharRef ::= '&#x' [0-9a-fA-F]+ ';' namedCharRef ::= '&' charName ';' charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos' # Comments comment ::= '<!--' (commentContentStart commentContentContinue*)? '-->' # Enforce the HTML5 restriction that comments cannot start with '-' or '->' commentContentStart ::= (char - ('-'|'>')) | ('-' (char - ('-'|'>'))) # As in XML 1.0 commentContentContinue ::= (char - '-') | ('-' (char - '-')) # Names name ::= nameStartChar nameChar* nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] # White space s ::= #x9 | #xA | #xD | #x20 # Characters char ::= s | ([#x21-#x10FFFF] - forbiddenChar) forbiddenChar ::= surrogateChar | #FFFE | #FFFF surrogateChar ::= [#xD800-#xDFFF]