One of my New Year’s resolutions is to blog more. I don’t expect I’ll have much more success with this than I usually do with my New Year’s resolutions, but at least I can make a start.
I have been continuing to have a dialog with some folks at Microsoft about M. This has led me to do a lot of thinking about what is good and bad about the XML family of standards.
The standard I found it most hard to reach a conclusion about was XML Namespaces. On the one hand, the pain that is caused by XML Namespaces seems massively out of proportion to the benefits that they provide. Yet, every step on the process that led to the current situation with XML Namespaces seems reasonable.
- We need a way to do distributed extensibility (somebody should be able to choose a name for an element or attribute that won’t conflict with anybody else’s name without having to check with some central naming).
- The one true way of naming things on the Web is with a URI.
- XML is supposed to be human readable/writable so we can’t expect people to put URIs in every element/attribute name, so we need a shorter human-friendly name and a way to bind that to a URI.
- Bindings need to nest so that XML Namespace-generating processes can stream, and so that one document can easily be embedded in another.
- XML Namespace processing should be layered on top of XML 1.0 processing.
- Content and attribute values can contain strings that represent element and attribute names; these strings should be handled uniformly with names that the XML parser recognizes as element and attribute names.
I would claim that the aspect of XML Namespaces that causes pain is the URI/prefix duality: the thing that occurs in the document (the prefix + local name) is not the same as the thing that is semantically significant (the namespace URI + local name). As soon as you accept this duality, I believe you are doomed to a significant extra layer of complexity.
The need for this duality stemmed from the use of URIs for names. As far as I remember, there was actually no discussion in the XML WG on this point when we were doing XML Namespaces: it was treated as axiomatic that URIs were the right thing to use here. But this is where I believe XML Namespaces went wrong.
From a purely practical point of view, the argument for naming namespaces with URIs is that you can do a GET on the URI and get something human- or machine-readable back that tells you about the semantics of the namespace. I have two responses to this:
- This is a capability that is occasionally useful, but it’s not that useful. The utility here is of a completely different order of magnitude compared to the disutility that results from the prefix/URI duality. Of course, if you are a RDF aficionado, you probably disagree.
- You can make names resolvable without using URIs. For example, a MIME-type X/Y can be made resolvable by having a convention that it http://www.iana.org/assignments/media-types/X/Y; or, if you have a dotted DNS-style name (e.g. org.example.bar.foo), you can use DNS TXT records to make it resolvable.
From a more theoretical point of view, I think the insistence on URIs for namespaces is paying insufficient attention to the distinction between instances of things and types of things. The Web works as well as it does because there is an extraordinarily large number of instances of things (ie Web pages) and a relatively very small number of types of things (ie MIME types). Completely different considerations apply to naming instances and naming types: both the scale and the goals are completely different. URIs are the right way to name instances of things on the Web; it doesn’t follow that they are the right way to name types of things.
I also have a (not very well substantiated) feeling that using URIs for namespaces tends to increase coupling between XML documents and their processing. An example is that people tend to assume that you can determine the XML schema for a document just by looking at the namespace URI of the document element.
What lessons can we draw from this?
For XML, what is done is done. As far as I can tell, there is zero interest amongst major vendors in cleaning up or simplifying XML. I have only two small suggestions, one for XML language designers and one for XML tool vendors:
- For XML language designers, think whether it is really necessary to use XML Namespaces. Don’t just mindlessly stick everything in a namespace because everybody else does. Using namespaces is not without cost. There is no inherent virtue in forcing users to stick xmlns=”…” on the document element.
- For XML vendors, make sure your tool has good support for documents that don’t use namespaces. For example, don’t make the namespace URI be the only way to automatically find a schema for a document.
What about future formats? First, I believe there is a real problem here and a format should define a convention (possibly with some supporting syntax) to solve the problem. Second, a solution that involves a prefix/URI duality is probably not a good approach.
Third, a purely registry-based solution imposes centralization in situations where there’s no need. On the other hand, a purely DNS-based solution puts all extensions on the same level, when in reality from a social perspective extensions are very different: an extension that has been standardized or has a public specification is very different from an ad hoc extension used by a single vendor. It’s good if a technology encourages cooperation and coordination.
My current thinking is that a blend of registry- and DNS-based approaches would be nice. For example, you might have something like this:
- names consist of one or more components separated by dots;
- usually names consist of a single component, and their meaning is determined contextually;
- names consisting of multiple components are used for extensions; the initial component must be registered (the registration process can be as lightweight as adding an entry to a wiki, like WHATWG does HTML5 for rel values);
- there is a well-known URI for each registered initial component;
- one registered initial component is “dns”: the remaining components are a reversed DNS name (Mark Nottingham’s had a ID like this for MIME types); there’s some way of resolving such a name into a URI.
Some other people’s thinking on this that I’ve found helpful: Mark Nottingham, Jeni Tennison, Tim Bray (and the rest of that xml-dev thread).
25 comments:
Firstly, I think the biggest mistake was layering namespaces on top of XML 1.0. The layering is wrong: unique naming is far too fundamental to be an optional extra feature, and the semantic overloading of attributes is one of the causes of the unnecessary complexity.
Secondly, I think the notion that every unique identifier should be a URI is misguided. Identifiers only need to be unique within a symbol space, and namespaces could have been defined within their own symbol space. The use of URIs masked a fundamental confusion/disagreement about whether or not it was meaningful to deference a namespace name, which surfaced later in the "relative URI" debacle that still causes interoperability problems. Namespaces could have been named like Java packages, and p:local could then have been taken as a keyboard shortcut for an element named com.jclark.ns.local - no QNames needed in the data model.
The final cause of the trouble was use of prefixes in content. Or perhaps more fundamentally, the absence of a data model that stated clearly whether or not prefixes were significant. A data model in which prefixes were clearly and unambiguously only an input shortcut would have removed many of the problems, at the cost of requiring an application-level mechanism to be used by vocabularies that needed XPath expressions or QNames as data values.
I'm not sure URIs as such are the real problem. I think the mistakes that were made are one you don't mention and one you do.
The mistake you mention is that namespaces are used in XML content, not just in XML names. This is a particular problem in XPath, but it shows up in a lot of other contexts. I'm afraid we're stuck with that.
However what really pushed namespaces ove the top in complexity was the late and unnecessary decision to make namespace mappings element scoped instead of document scoped. The ability to override namespace mappings within a document--especially non-default mappings--is unnecessary but adds a huge amount of effort to namespace processing. If there were only one namespace URI per prefix in the entire document, and if those prefixes were declared in the document header rather than willy nilly throughout the document, you'd lose no expressivity but make processing vastly simpler.
Elliotte, My point 4 that "Bindings need to nest" was my (rather cryptic) way of referring to what you describe as making "namespace mapping elements scoped".
James, how would you suggest that XML documents can be associated with schemas (of any flavour) without using namespaces? In truth, a lot of users need some unambiguous way to make this association, and it would be a problem if different vendors provided different methods of making the association. In practice, you would need both Xerces and .NET to implement the same mechanism in order to get it the necessary traction.
That said, did you have some thoughts about this? I must confess to not being much of a RELAX NG user, so I'm not familiar with the mechanisms that work there.
Thanks, Cheers, Tony.
James, you wrote: "There is no inherent virtue in forcing users to stick xmlns='...' on the document element."
I am surprised by this statement James. To get the benefits of NVDL and compound documents, wouldn't you want to encourage people to use namespaces, not discourage them?
Hi James,
There may be no inherent virtue in sticking xmlns="..." at the root of a document, but I think there is virtue in having all names in a namespace.
I find it not uncommon to have to work with multiple vocabularies at once, processing inputs in several different schemas designed for similar purposes by entirely different individuals. The ability to know that the "p" I have in my left hand is the "p" from vocabulary A and not the one from vocabulary B *greatly* simplifies the task of processing that "p".
James Clark blogging again! What a happy New Year!
We all know, that namespaces with XML need to be used, to avoid XML name collisions (for e.g, it helps us to distinguish between, person:name and organization:name which is fundamentally required in XML design). I find nothing wrong, with XML namespace architecture, and it serves the purpose very well, for which it's intended.
If I'm designing an XML vocabulary, I usually like to associate a namespace with my vocabulary, as it gives a feeling to me, that this vocabulary belongs to me (or the domain for which it is designed), and also gives me freedom to choose same local-names (for XML element and attributes), as any other XML vocabulary, which may exist.
It seems like this has turned into an over-/under-constrained problem (and not being sure which is part of the breakdown along with decontextualized examination of the problem).
Even with DNS-related schemes like com.orcmid.LLC.pa.pn.ALPHA there is the desire for a mapping to resolvable URIs (to access JavaDoc, for example, and obsolescence warnings too perhaps).
The aspect that strikes me as the most deficient is the inability to have certainty with regard to schemas. Along with that, there are problems about being able to host components with different schemas in documents governed by other primary schemas and have that work easily with whatever the schema mechanism is. Repurposing seems to be unnecessarily complicated and the tolerance of namespace abuse (oh, not that schema, this one, and these other interestingly-different semantics too, nudge, nudge, wink, wink) appears to be hopeless.
Since we have already abdicated the use of "." in Names, I hazard the observation that currently QNames can have at most one ":", leaving a marvelous prospect for living dangerously in muddied waters.
It is great to see your promise of more posting.
My current thinking is that a blend of registry- and DNS-based approaches would be nice.
That has its own crop of problems. It bakes the political history of a concept into its name – witness the mess of X- headers in email. Why do some headers carry this prefix and others not? There is no substantial reason – only accidents of history.
XML Namespace processing should be layered on top of XML 1.0 processing
This didn't exactly happen. XML says attributes are unordered, and indeed many XML processors change attribute order at will. However, the XML Namespaces spec says that xmlns attributes must come first. It seems a curious exception to XML for no good reason.
Perhaps the original mistake was to allow unordered elements in XML.
Actually, the XML Namespaces Rec does not require xmlns attributes to come first (which adds an additional complication for implementations).
Happy New Year, James!
I very much agree with you.
It seems the HTML 5 crowd also dislikes namespace syntax, and have deviced that when an SVG element occurs, it is to be placed automatically by the processor into an SVG namespace.
Thinking about how one might specify this behaviour extensibly, rather than hard-wirng a set of URIs (and prefixes!) led me to the "Unobtrusive Namespaces" and "Imaginary Namespaces" proposals, at http://www.barefootliam.org/xml/20091111-unobtrusive-namespaces although if you are interested.
One of the most frequently asked questions I hear is, Why doesn't my XPath work, and th emost common answer is that the document places all the elements in some namespace, often just because it's "Best Practice" and not because it's needed. Maybe we (W3C, but also the wider XML community) need to go on a massive namespace purge, and make clearer to people that you don't ened to wear an identifying name-tag in your own home.
Liam
[oops, looks like it didn't work first time, sorry if you get this twice]
Oh sorry. It'a actually the xpath spec that says "The namespace nodes are defined to occur before the attribute nodes". And the c14n spec says something else.
James, I like a lot your "usually names consist of a single component, and their meaning is determined contextually;".
In the CAM toolkit and templates approach we are using dictionaries now to facilitate simple content assembly (download from sourceforge.net). So this means your XML instances can be very simple, and logic can go back to the template to resolve semantic details. The snag is XSD Schema is using namespaces for all kinds of ugly devices around import and such. With the dictionary approach life is simpler again. And the dictionary format we are using will open as an Excel spreadsheet so it is human assessible! Novel concept...
We still have more to do and learn in 2010. Aligning more to the CEFACT CCTS work is one; more content assembly methods with the expander tool is another. We already have some dictionary functions using CCTS. Also NDR (naming and design rules checks) in the CAM toolkit to help people make better schema. We also have a renamer utility - primarily to make SQL names match NIEM.gov NDR conventions - but I'm also planning to make it capable of removing namespaces from schema - two modes - remove - i.e. this:name becomes this_myname or strip - where this:myname is now simply myname - and then structure hierarchy use context and the dictionary template define the semantics and sourcing.
We can definitely do this all better within the existing confines of XML - and I totally agree - ONLY use namespaces when you have to - not as a default.
Happy simpler XML in 2010 from the CAM team!
Liam,
As a part of XML theory and practice, I believe, that XML namespaces are quite useful. That's the core concept on which XML vocabularies are designed (for e.g, XSLT, XSD, RELAX NG etc. it's useful [and necessary] to know, that xsl:element belongs to the XSLT language -- because of the "xsl" prefix, that is bound to the XSLT namespace). But I agree, that using namespaces in XML documents (in end user XML documents) out of proportion is not a good idea. If namespaces are not required in a given problem circumstance, they shouldn't be used, as it unnecessaraily confuses readers of the XML document, and may also cause pain while debugging too (the XPath example you gave :)). But each problem description is different! But I think, in general XML namespaces are a great part of XML.
regards
James, part 2 here of thoughts - what we also need is a check list for schema designers of when namespaces are required / desired and when they are not.
For example in CAM templates the only place we have found you need to use namespaces is to embed overlay markup into markup, where that markup itself is handling instructions that are not part of the original content structure. So parsing logic can transparently distinguish the original and act on the instructions.
e.g.
original has as: instructions inserted as attributes or elements:
type=stuffTypeDef
Contraywise examples of where people are using namespaces where they are likely superfluous. And of course inline namespaces compared to globally defined ones obviously incur ugly overhead as you already noted.
I think there were two namespace design mistakes:
1) Not adding markup to distinguish qnames in content (thereby making automatic handling of prefix declarations a minefield) and
2) Not making sure DTDs and namespaces don't conflict.
Fixing either would have broken XML 1.0 so a new version of XML should have been created for namespaces. On the other hand, I'm quite content with the use of URIs to identify namespaces. We don't need yet another scheme to look things up.
I agree with the idea to integrate namespace in the XML spec.
Why ? Just because it will force people to be coherent with the namespace idea.
the best exemple is in the RDF/OWL standard family.
I think it tooks me too much time to understand that :
- sometimes they were using namespaces in the original idea
- and in some other times they where using syntax short names using the same uri:name model but no more referring to namespace.
Pierre
I think that 4 and 6 in your first list as well as namespace declarations as attributes (rather than PIs) are causing problems.
In LMNL we had an opportunity to redo namespaces from scratch. Here are the decisions we made:
We kept URIs and prefixes, but require a biunique URI-prefix mapping throughout a document (except for the empty prefix; see below).
Namespace declarations can appear anywhere that whitespace can appear (with a few exceptions) and have the unique syntax [!ns prefix="IRI"].
The scope of a namespace declaration is from the end of the declaration to the end of the document, so it's possible to introduce them other than at the top, but you have to pay attention to what prefixes are already declared.
Declarations of the empty prefix are of the form [!ns "IRI"]. If the empty prefix is undeclared, names without prefixes are in the reserved namespace http://lmnl.net/namespaces/innominate. If a declaration has been seen for the empty prefix, all future names without prefixes are in that namespace. For consistency, then, you should declare the empty prefix at the top or not at all.
The lmnl prefix need not be declared; its URI is always http://lmnl.net/namespaces/lmnl.
I think the real issue with namespaces is that they couple processing and semantics of XML in a way that wasn't strictly necessary, and which added significant overhead.
For the most part, XML defines a way to encode a document, but says very little about the actual interpretation. Namespaces added significantly to the burden of syntax interpretation (witness the DOM specification additions), but still don't really do much with semantic interpretation: that burden has always, and still does, remain with whatever is processing the XML. You can accomplish much the same thing by externalizing the bindings, which is essentially what you end up doing when you process the document anyway.
Another way to say it is that namespaces try to enforce policy, and that actually fights against one of the basic principals of XML.
It's a shame that you no longer appear to believe in the virtues of namespaces. I think the problem with namespaces, as with any tool, as not as much the specification itself, as how it's used in practice. For example relying on prefixes rather than the actual URI's. Also, a more abstract convention for namespace URIs may be in place, for example magnet links for namespaces rather than HTTP URIs.
I think the problem with namespaces being layered is that they are useful for defining a standard vocabulary for things like comments which could have simply had a namespace rather than a completely distinct syntax. And they could in terms have been layered on top of the basic specification. That would have been a better layering than layering the namespacing. - Isn't hindsight great?
Sadly that's not the case. The same could be said for several other features like processing instructions. Do you really need a special syntax for these?
Also, is human-readability a benefit or an issue? Human-readability leads to things like attributes which leads to things like the XML Information Set which treats attributes like something distinct from elements. Really, there either be only elements and editors should have ways to represent simple elements in a smart way. Also, attributes lead down the path of micro-formats which completely abandon the elegant underlying XML data model.
Again, the benefit of hindsight...
Working on Namespace since quite some time, I believe, that this will need an out of the box approach. I wonder if there is any work being done by way of a service to provide Namespace? I am myself trying to work in this direction and and looking for like minded people. My email address is cloucomp@rediffmail.com
Post a Comment