2010-01-02

XML Namespaces

One of my New Year’s resolutions is to blog more.  I don’t expect I’ll have much more success with this than I usually do with my New Year’s resolutions, but at least I can make a  start.

I have been continuing to have a dialog with some folks at Microsoft about M.  This has led me to do a lot of thinking about what is good and bad about the XML family of standards.

The standard I found it most hard to reach a conclusion about was XML Namespaces.  On the one hand, the pain that is caused by XML Namespaces seems massively out of proportion to the benefits that they provide.  Yet, every step on the process that led to the current situation with XML Namespaces seems reasonable.

  1. We need a way to do distributed extensibility (somebody should be able to choose a name for an element or attribute that won’t conflict with anybody else’s name without having to check with some central naming).
  2. The one true way of naming things on the Web is with a URI.
  3. XML is supposed to be human readable/writable so we can’t expect people to put URIs in every element/attribute name, so we need a shorter human-friendly name and a way to bind that to a URI.
  4. Bindings need to nest so that XML Namespace-generating processes can stream, and so that one document can easily be embedded in another.
  5. XML Namespace processing should be layered on top of XML 1.0 processing.
  6. Content and attribute values can contain strings that represent element and attribute names; these strings should be handled uniformly with names that the XML parser recognizes as element and attribute names.

I would claim that the aspect of XML Namespaces that causes pain is the URI/prefix duality: the thing that occurs in the document (the prefix + local name) is not the same as the thing that is semantically significant (the namespace URI + local name).  As soon as you accept this duality, I believe you are doomed to a significant extra layer of complexity.

The need for this duality stemmed from the use of URIs for names. As far as I remember, there was actually no discussion in the XML WG on this point when we were doing XML Namespaces: it was treated as axiomatic that URIs were the right thing to use here. But this is where I believe XML Namespaces went wrong.

From a purely practical point of view, the argument for naming namespaces with URIs is that you can do a GET on the URI and get something human- or machine-readable back that tells you about the semantics of the namespace.  I have two responses to this:

  • This is a capability that is occasionally useful, but it’s not that useful.  The utility here is of a completely different order of magnitude compared to the disutility that results from the prefix/URI duality.  Of course, if you are a RDF aficionado, you probably disagree.
  • You can make names resolvable without using URIs.  For example, a MIME-type X/Y can be made resolvable by having a convention that it http://www.iana.org/assignments/media-types/X/Y; or, if you have a dotted DNS-style name (e.g. org.example.bar.foo), you can use DNS TXT records to make it resolvable.

From a more theoretical point of view, I think the insistence on URIs for namespaces is paying insufficient attention to the distinction between instances of things and types of things.  The Web works as well as it does because there is an extraordinarily large number of instances of things (ie Web pages) and a relatively very small number of types of things (ie MIME types).  Completely different considerations apply to naming instances and naming types: both the scale and the goals are completely different.  URIs are the right way to name instances of things on the Web; it doesn’t follow that they are the right way to name types of things.

I also have a (not very well substantiated) feeling that using URIs for namespaces tends to increase coupling between XML documents and their processing.  An example is that people tend to assume that you can determine the XML schema for a document just by looking at the namespace URI of the document element.

What lessons can we draw from this?

For XML, what is done is done.  As far as I can tell, there is zero interest amongst major vendors in cleaning up or simplifying XML. I have only two small suggestions, one for XML language designers and one for XML tool vendors:

  • For XML language designers, think whether it is really necessary to use XML Namespaces. Don’t just mindlessly stick everything in a namespace because everybody else does.  Using namespaces is not without cost. There is no inherent virtue in forcing users to stick xmlns=”…” on the document element.
  • For XML vendors, make sure your tool has good support for documents that don’t use namespaces.  For example, don’t make the namespace URI be the only way to automatically find a schema for a document.

What about future formats?  First, I believe there is a real problem here and a format should define a convention (possibly with some supporting syntax) to solve the problem. Second, a solution that involves a prefix/URI duality is probably not a good approach.

Third, a purely registry-based solution imposes centralization in situations where there’s no need. On the other hand, a purely DNS-based solution puts all extensions on the same level, when in reality from a social perspective extensions are very different: an extension that has been standardized or has a public specification is very different from an ad hoc extension used by a single vendor.  It’s good if a technology encourages cooperation and coordination.

My current thinking is that a blend of registry- and DNS-based approaches would be nice.  For example, you might have something like this:

  • names consist of one or more components separated by dots;
  • usually names consist of a single component, and their meaning is determined contextually;
  • names consisting of multiple components are used for extensions; the initial component must be registered (the registration process can be as lightweight as adding an entry to a wiki, like WHATWG does HTML5 for rel values);
  • there is a well-known URI for each registered initial component;
  • one registered initial component is “dns”: the remaining components are a reversed DNS name (Mark Nottingham’s had a ID like this for MIME types); there’s some way of resolving such a name into a URI.

Some other people’s thinking on this that I’ve found helpful: Mark Nottingham, Jeni Tennison, Tim Bray (and the rest of that xml-dev thread).