2008-10-17

XML 1.0 5th edition

Rather late in the day, I sent a comment in on the proposed XML 1.0 5th Edition. For some background, read Norman Walsh, John Cowan, David Carlisle and Henry Thompson.

There is a real problem here, and it's partly my fault. If we had had a bit more foresight ten years ago, we would have made the 1st edition of XML 1.0 say what is now being proposed for the 5th Edition. I know that the XML Core WG are trying to do the right thing, but I really don't think this is a good idea.

I think you've got to look at the impact of the change not just on XML 1.0 but on the whole universe of specs that are built on top of XML 1.0.  In an ideal world, all the specs that refer to XML 1.0 would have carefully chosen whether to make a dated or an undated reference to XML 1.0, and would have done so consistently and with a full consideration of the consequences of the choice.  In practice, I don't believe this has happened.  Indeed, before the 5th edition, I believe very few people would have considered that XML might make a fundamental change to its philosophy about which characters were allowed in names while still keeping the same version number.

Even W3C specs don't get this right. In particular, XML Namespaces 1.0 gets completely broken by this (as my comment explains).

Now you can argue that the breakage and chaos that the 5th edition would cause is due to bugs in the specs that reference XML 1.0. But that doesn't make the breakage any less real.

I also have the rather heretical view that the benefits of the change are small.  In terms of Unicode support, what's vitally important is that any Unicode character is allowed in attribute values and character data.  And XML 1.0 has always supported that.  This change is just about the Unicode characters allowed in element and attribute  names (and entity names and processing instruction targets).

I see relatively little use of non-ASCII characters in element and attribute names.  A user who is technical enough to deal with raw XML markup can deal with ASCII element/attribute names.  For less technical users who want to see element/attribute names in their native language, using native language markup is not a good solution, because it only allows a document or schema to be localized for a single language. An XML editor can provide a much better solution by supporting schema annotations that allow an element or attribute to be given friendly names in multiple languages.  So a Thai user editing a document using the schema can work with Thai element/attribute names, and an English user working with the same document can see English names.

This is just following basic I18N principles of storing/exchanging information in a language neutral form, and then localizing it when you present it to a particular user. (This is the same reason why it's perfectly OK from an I18N perspective for XML Schema Datatypes just to support one specific non-localized format for dates/times.)

Perhaps this is part of the reason why there was so little enthusiasm for XML 1.1, and why there seems to be little interest in doing the 5th edition change as an XML 1.2.

One case where I can see real value in adopting  more permissive rules for names is in XML Schema Datatypes, because this relates to character data and attribute values.  But it seems like you could easily fix this, without any of the problems that the 5th edition would cause, by introducing a couple of new datatypes into XML Schema.