2008-10-17

XML 1.0 5th edition

Rather late in the day, I sent a comment in on the proposed XML 1.0 5th Edition. For some background, read Norman Walsh, John Cowan, David Carlisle and Henry Thompson.

There is a real problem here, and it's partly my fault. If we had had a bit more foresight ten years ago, we would have made the 1st edition of XML 1.0 say what is now being proposed for the 5th Edition. I know that the XML Core WG are trying to do the right thing, but I really don't think this is a good idea.

I think you've got to look at the impact of the change not just on XML 1.0 but on the whole universe of specs that are built on top of XML 1.0.  In an ideal world, all the specs that refer to XML 1.0 would have carefully chosen whether to make a dated or an undated reference to XML 1.0, and would have done so consistently and with a full consideration of the consequences of the choice.  In practice, I don't believe this has happened.  Indeed, before the 5th edition, I believe very few people would have considered that XML might make a fundamental change to its philosophy about which characters were allowed in names while still keeping the same version number.

Even W3C specs don't get this right. In particular, XML Namespaces 1.0 gets completely broken by this (as my comment explains).

Now you can argue that the breakage and chaos that the 5th edition would cause is due to bugs in the specs that reference XML 1.0. But that doesn't make the breakage any less real.

I also have the rather heretical view that the benefits of the change are small.  In terms of Unicode support, what's vitally important is that any Unicode character is allowed in attribute values and character data.  And XML 1.0 has always supported that.  This change is just about the Unicode characters allowed in element and attribute  names (and entity names and processing instruction targets).

I see relatively little use of non-ASCII characters in element and attribute names.  A user who is technical enough to deal with raw XML markup can deal with ASCII element/attribute names.  For less technical users who want to see element/attribute names in their native language, using native language markup is not a good solution, because it only allows a document or schema to be localized for a single language. An XML editor can provide a much better solution by supporting schema annotations that allow an element or attribute to be given friendly names in multiple languages.  So a Thai user editing a document using the schema can work with Thai element/attribute names, and an English user working with the same document can see English names.

This is just following basic I18N principles of storing/exchanging information in a language neutral form, and then localizing it when you present it to a particular user. (This is the same reason why it's perfectly OK from an I18N perspective for XML Schema Datatypes just to support one specific non-localized format for dates/times.)

Perhaps this is part of the reason why there was so little enthusiasm for XML 1.1, and why there seems to be little interest in doing the 5th edition change as an XML 1.2.

One case where I can see real value in adopting  more permissive rules for names is in XML Schema Datatypes, because this relates to character data and attribute values.  But it seems like you could easily fix this, without any of the problems that the 5th edition would cause, by introducing a couple of new datatypes into XML Schema.

9 comments:

村田 said...

I was involved in a project of the Japanese government for creating schemas. These schemas represent information interchange between the local governments and the central government.

It was completely impossible to think of English names. My team extensively used Japanese names. I do not think that these names can be translated (unless I am willing to write a paragraph for each name).

I also heard from some doctor that his project uses Japanese tag names since they cannot be translated without changing the meaning.

I also believe that naming is more difficult than composition for non-native speakers. Poor naming is a problem of my schema language proposal (RELAX Core).

MURATA Makoto

barefootliam said...

A user who is technical enough to deal with raw XML markup can deal with ASCII element/attribute names

I don't see that technical knowledge of computers and familiarity with the Latin script should be tied together...

I agree with your argument that attribute and element names should be translated for the purposes of a user interface, for the purpose of localisation. However, any base language could be used, as long as you can find people to translate out of it (perhaps with the help of other translations). So I think this is not in any way an argument against the change.

We may have to revise the Namespaces spec, and there are other specs, not only at W3C, that may need to be changed over time, but it would seem fairer to me to say not that the namespaces spec is completele broken, since certainly all existing documents will continue to work fine, but that there may be an error that requires a minor revision there.

As I said in a private reply to you earlier, like you, I also originally preferred the idea of a 1.2, and was persuaded that it would not get uptake. So I see this as a compromise.

Liam

Rick Jelliffe said...

I probably agree with Tim Bray that a more thorough revision that creates something cohesive has more of a chance than this tinkering. XML 2.0.

But I tend to think that the root problem here is that while XML has version numbers, these only correspond to an overly simplistic policy: fail if you don't understand the version number.

So it would be better for XML to first be improved to a major.minor version system where a version rejects a document with an unknown major version number outright, but attempts to parse documents with higher-value minor versions. This way, an xml 1.0 system would not reject an xml 1.2 document unless there was indeed some name which wasn't allowed by XML 1.0.

Introducing this change as edition problem, then allowing it to percolate through implementations and deployments for a few years, sets us up to switching to the new naming regime without confusion.

The XML Core WG should have done this years ago: without it, they are in effect abandoning the versioning system entirely by retrofitting substantive changes without the benefit of labels. I don't know why markup people don't see clear labels as the primary tool to escape confusion: it is very bold, if not bizarre.

barefootliam said...

Rick, I agree with you about version numbers. Indeed, so does the XML Core WG, which is why 5e does what you are asking.

As for a 2.0, I think that could be a 10-year project (if you think HTML 5 is simple...), and it's not clear to me that it would get adoption.

If you have the time to read the proposed 5th edition and send comments to the mail address given there, they'll still of course be considered even though it's past the formal deadline. The 5th edition document is listed on http://www.w3.org/TR/

Liam

bio said...

Murata-san: what's wrong with using Romanized Japanese words for tag names? It's common enough for branding and signage in Japan, so why not in XML documents.

Liam: Technical knowledge of computers requires familiarity with Latin script (see: command lines, relational database table and row names, URIs). You can't unilaterally change that fact simply by changing XML.

barefootliam said...

bio, you say that technical knowledge of computers requires familiarity with Latin script. Not everyone agrees with this, although many do. However, use of XML does not in general require "technical knowledge of computers". For example, researchers in the arts and humanities often prefer to use their own language and scientific terms to describe things. Those terms may be understood very well by people in their own domains. And, of course, this change also affect IDs.

We committed to supporting natural-language markup in XML 1.0, not only English. For example, you can already use Chinese characters in XML names, as long as you are careful to stick to characters defined by Unicode 2.1, of course. The question is not whether to continue to support Unicode, but how best to do so. I accept that there is not complete agreement on which way is best, nor on which way will succeed. So far XML 5e seems to be the best compromise.

Thanks,

Liam

村田 said...

bio, the Japanes language has so many homonyms. Thus, roma-ji does not always make sense. Moreover, roma-ji is ugly and hard to read. I much prefer
市町村職員共済組合 to shichouson-shokuin-kyousai-kumiai. (This name appears in one of the schemas I created.)

Richard Ishida said...

In terms of Unicode support, what's vitally important is that any Unicode character is allowed in attribute values and character data. And XML 1.0 has always supported that. This change is just about the Unicode characters allowed in element and attribute names (and entity names and processing instruction targets).

There are three other issues that I think are not mentioned here, but are highly relevant.

[1] This change is also about the Unicode characters allowed in id values, for example in an anchor element in XHTML. These, in my mind, are much more likely to occur in non-Latin scripts, especially as IDN and IRIs become more prevalent. These are also constructs that are used by much less technically-minded authors. For example, I added an Ethiopic id value to the blog I point to below, and now the page doesn't validate.

[2] We have already designed XML in such a way that people are allowed to use non-ASCII element and attribute names and id values. That cat is out of the bag. We have allowed it, and we must expect that some people will want to take the spec at its word. What we are talking about in the 5th edition is more along the lines of avoiding arbitrary discrimination towards speakers of languages written with scripts that didn't make it into Unicode version 2.1, ie. speakers of languages written in Ethiopic, Canadian Syllabics, Khmer, Sinhala, Mongolian, Yi, Philippine, New Tai Lue, Buginese, Cherokee, Syloti Nagri, N’Ko, Tifinagh and other scripts.

[3] Staying with the idea that we have already allowed non-ASCII names and values, and people will use that feature, we need to be aware that individual characters have recently been added to Unicode blocks that existed in Unicode 2.1, such as Chinese, Cyrillic, Devanagari, Tamil, Bengali, Malayalam, and the like. In many cases these characters will see common use in languages that use these scripts today. This means that people who do take the XML spec at its word and exercise their right to use non-ASCII characters in things like ids, will either have difficulty understanding why one name works fine but another doesn't, or will have learned to write a somewhat stilted version of their language (at best).

The blog post I referred to above is at http://rishida.net/blog/?p=135.

Joshua said...

I probably agree with Tim that a thorough revision that creates something cohesive has more of a chance than this tinkering. XML 2.0. Thanks for your comments.