Showing posts with label xml. Show all posts
Showing posts with label xml. Show all posts

2010-12-13

MicroXML

There's been a lot of discussion on the xml-dev mailing list recently about the future of XML.  I see a number of different possible directions.  I'll give each of these possible directions a simple name:

  • XML 2.0 - by this I mean something that is intended to replace XML 1.0, but has a high degree of backward compatibility with XML 1.0;
  • XML.next - by this I mean something that is intended to be a more functional replacement for XML, but is not designed to be compatible (however, it would be rich enough that there would presumably be a way to translate JSON or XML into it);
  • MicroXML - by this I mean a subset of XML 1.0 that is not intended to replace XML 1.0, but is intended for contexts where XML 1.0 is, or is perceived as, too heavyweight.

I am not optimistic about XML 2.0. There is a lot of inertia behind XML, and anything that is perceived as changing XML is going to meet with heavy resistance.  Furthermore, backwards compatibility with XML 1.0 and XML Namespaces would limit the potential for producing a clean, understandable language with really substantial improvements over XML 1.0.

XML.next is a big project, because it needs to tackle not just XML but the whole XML stack. It is not something that can be designed by a committee from nothing; there would need to be one or more solid implementations that could serve as a basis for standardization.  Also given the lack of compatibility, the design will have to be really compelling to get traction. I have a lot of thoughts about this, but I will leave them for another post.

In this post, I want to focus on MicroXML. One obvious objection is that there is no point in doing a subset now, because of the costs of XML complexity have already been paid.  I have a number of responses to this. First, XML complexity continues to have a cost even when XML parsers and other tools have been written; it is an ongoing cost to users of XML and developers of XML applications. Second, the main appeal of MicroXML should be to those who are not using XML, because they find XML overly complex. Third, many specifications that support XML are in fact already using their own ad-hoc subsets of XML (eg XMPP, SOAP, E4X, Scala). Fourth, this argument applied to SGML would imply that XML was pointless.

HTML5 is another major factor. HTML5 defines an XML syntax (ie XHTML) as well as an HTML syntax. However, there are a variety of practical reasons why XHTML, by which I mean XHTML served as application/xml+xhtml, isn't common on the Web. For example, IE doesn't support XHTML; Mozilla doesn't incrementally render XHTML.  HTML5 makes it possible to have "polyglot" documents that are simultaneously well-formed XML and valid HTML5.  I think this is potentially a superb format for documents: it's rich enough to represent a wide range of documents, it's much simpler than full HTML5, and it can be processed using XML tools. There's an W3C WD for this. The WD defines polyglot documents in a slightly different way, requiring them to produce the same DOM when parsed as XHTML as when parsed as HTML; I don't see much value in this, since I don't see much benefit in serving documents as application/xml+xhtml.  The practical problem with polyglot documents is that they require the author to obey a whole slew of subtle lexical restrictions that are hard to enforce using an XML toolchain and a schema language. (Schematron can do a bit better here than RELAX NG or XSD.)

So one of the major design goals I have for MicroXML is to facilitate polyglot documents.  More precisely the goal is that a document can be guaranteed to be a valid polyglot document if:

  1. it is well-formed MicroXML, and
  2. it satisfies constraints that are expressed purely in terms of the MicroXML data model.

Now let's look in detail at what MicroXML might consist of. (When I talk about HTML5 in the following, I am talking about its HTML syntax, not its XML syntax.)

  • Specification. I believe it is important that MicroXML has its own self-contained specification, rather being defined as a delta on existing specifications.
  • DOCTYPE declaration. Clearly the internal subset should not be allowed.  The DOCTYPE declaration itself is problematic. HTML5 requires valid HTML5 documents to start with a DOCTYPE declaration.  However, HTML5 uses DOCTYPE declarations in a fundamentally different way to XML: instead of referencing an external DTD subset which is supposed to be parsed, it tells the HTML parser what parsing mode to use.  Another factor is that almost the only thing that the XML subsets out there agree on is to disallow the DOCTYPE declaration.  So my current inclination is to disallow the DOCTYPE declaration in MicroXML. This would mean that MicroXML does not completely achieve the goal I set above for polyglot documents. However, you would be able to author a <body> or a <section> or an <article> as MicroXML; this would then have to be assembled into a valid HTML5 document by a separate process (albeit a very simple one). It would be great if HTML5 provided an alternate way (using attributes or elements) to declare that an HTML document be parsed in standards mode. Perhaps a boolean "standard" attribute on the <meta> element?
  • Error handling. Many people in the HTML community view XML's draconian error handling as a major problem.  In some contexts, I have to agree: it is not helpful for a user agent to stop processing and show an error, when a user is not in a position to do anything about the error. I believe MicroXML should not impose any specific error handling policy; it should restrict itself to specifying when a document is conforming and specifying the instance of the data model that is produced for a conforming document. It would be possible to have a specification layered on top of MicroXML that would define detailed error handling (as for example in the XML5 specification).
  • Namespaces. This is probably the hardest and most controversial issue. I think the right answer is to take a deep breath and just say no. One big reason is that the HTML5 does not support namespaces (remember, I am talking about the HTML syntax of HTML5). Another reason is that the basic idea of binding prefixes to URIs is just too hard; the WHATWG wiki has a good page on this. The question then becomes how does MicroXML handle the problems that XML Namespaces addresses. What do you do if you need to create a document that combines multiple independent vocabularies? I would suggest two mechanisms:
    • I would support the use of the xmlns attribute (not xmlns:x, just bare xmlns). However, as far as the MicroXML data model is concerned, it's just another attribute. It thus works in a very similar way to xml:lang: it would be allowed only where a schema language explicitly permits it; semantically it works as an inherited attribute; it does not magically change the names of elements.
    • I would also support the use of prefixes.  The big difference is that prefixes would be meaningful and would not have to be declared.  Conflicts between prefixes would be avoided by community cooperation rather than by namespace declarations.  I would divide prefixes into two categories: prefixes without any periods, and prefixes with one or more periods.  Prefixes without periods would have a lightweight registration procedure (ie a mailing list and a wiki); prefixes with periods would be intended for private use only and would follow a reverse domain name convention (e.g. com.jclark.foo). For compatibility with XML tools that require documents to be namespace-well-formed, it would be possible for MicroXML documents to include xmlns:* attributes for the prefixes it uses (and a schema could require this). Note that these would be attributes from the MicroXML perspective. Alternatively, a MicroXML parser could insert suitable declarations when it is acting as a front-end for a tool that expects an namespace well-formed XML infoset.
  • Comments. Allowed, but restricted to be HTML5-compatible; HTML5 does not allow the content of a comment to start with -or ->.
  • Processing instructions. Not allowed. (HTML5 does not allow processing instructions.)
  • Data model.  The MicroXML specification should define a single, normative data model for MicroXML documents. It should be as simple possible:
    • The model for a MicroXML document consists of a single element.
    • Comments are not included in the normative data model.
    • An element consists of a name, attributes and content.
    • A name is a string. It can be split into two parts: a prefix, which is either empty or ends in a colon, and local name.
    • Attributes are a map from names to Unicode strings (sequences of Unicode code-points).
    • Content is an ordered sequence of Unicode code-points and elements.
    • An element probably also needs to have a flag saying whether it's an empty element. This is unfortunate but HTML5 does not treat an empty element as equivalent to a start-tag immediately followed by an end-tag: elements like <br> cannot have end-tag, and elements that can have content such as <a> cannot use the empty element syntax even if they happen to be empty. (It would be really nice if this could be fixed in HTML5.)
  • Encoding. UTF-8 only. Unicode in the UTF-8 encoding is already used for nearly 50% of the Web. See this post from Google.  XML 1.0 also requires support for UTF-16, but UTF-16 is not in my view used sufficiently on the Web to justify requiring support for UTF-16 but not other more widely used encodings like US-ASCII and ISO-8859-1.
  • XML declaration. Not allowed. Given UTF-8 only and no DOCTYPE declarations, it is unnecessary. (HTML5 does not allow XML declarations.)
  • Names. What characters should be allowed in an element or attribute name? I can see three reasonable choices here: (a) XML 1.0 4th edition, (b) XML 1.0 5th edition or (c) the ASCII-only subset of XML name characters (same in 4th and 5th editions). I would incline to (b) on the basis that (a) is too complicated and (c) loses too much expressive power.
  • Attribute value normalization. I think this has to go.  HTML5 does not do attribute value normalization. This means that it is theoretically possible for a MicroXML document to be interpreted slightly differently by an XML processor than by a MicroXML processor.  However, I think this is unlikely to be a problem in practice.  Do people really put newlines in attribute values and rely on their being turned into spaces?  I doubt it.
  • Newline normalization. This should stay.  It makes things simpler for users and application developers.  HTML5 has it as well.
  • Character references.  Without DOCTYPE declarations, only the five built-in character entities can be referenced. Things could be simplified a little by allowing only hex or only decimal numeric character references, but I don't think this is worthwhile.
  • CDATA sections. I think best to disallow. (HTML5 allows CDATA sections only in foreign elements.) XML 1.0 does not allow the three-character sequence ]]> to occur in content. This restriction becomes even more arbitrary and ugly when you remove CDATA sections, so I think it is simpler just to require > to always be entered using a character reference in content.

Here's a complete grammar for MicroXML (using the same notation as the XML 1.0 Recommendation):

# Documents
document ::= (comment | s)* element (comment | s)*
element ::= startTag content endTag
          | emptyElementTag
content ::= (element | comment | dataChar | charRef)*
startTag ::= '<' name (s+ attribute)* s* '>'
emptyElementTag ::= '<' name (s+ attribute)* s* '/>'
endTag ::= '</' name s* '>'
# Attributes
attribute ::= name s* '=' s* attributeValue
attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"'
                 | "'" ((attributeValueChar - "'") | charRef)* "'"
attributeValueChar ::= char - ('<'|'&')
# Data characters
dataChar ::= char - ('<'|'&'|'>')
# Character references
charRef ::= decCharRef | hexCharRef | namedCharRef
decCharRef ::= '&#' [0-9]+ ';'
hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'
namedCharRef ::= '&' charName ';'
charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'
# Comments
comment ::= '<!--' (commentContentStart commentContentContinue*)? '-->'
# Enforce the HTML5 restriction that comments cannot start with '-' or '->'
commentContentStart ::= (char - ('-'|'>')) | ('-' (char - ('-'|'>')))
# As in XML 1.0
commentContentContinue ::= (char - '-') | ('-' (char - '-'))
# Names
name ::= (simpleName ':')? simpleName
simpleName ::= nameStartChar nameChar*
nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D]
                | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF]
                | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
# White space
s ::= #x9 | #xA | #xD | #x20
# Characters
char ::= s | ([#x21-#x10FFFF] - forbiddenChar)
forbiddenChar ::= surrogateChar | #FFFE | #FFFF
surrogateChar ::= [#xD800-#xDFFF]

2010-11-24

XML vs the Web

Twitter and Foursquare recently removed XML support from their Web APIs, and now support only JSON.  This prompted Norman Walsh to write an interesting post, in which he summarised his reaction as "Meh". I won't try to summarise his post; it's short and well-worth reading.

From one perspective, it's hard to disagree.  If you're an XML wizard with a decade or two of experience with XML and SGML before that, if you're an expert user of the entire XML stack (eg XQuery, XSLT2, schemas), if most of your data involves mixed content, then JSON isn't going to be supplanting XML any time soon in your toolbox.

Personally, I got into XML not to make my life as a developer easier, nor because I had a particular enthusiasm for angle brackets, but because I wanted to promote some of the things that XML facilitates, including:

  • textual (non-binary) data formats;
  • open standard data formats;
  • data longevity;
  • data reuse;
  • separation of presentation from content.

If other formats start to supplant XML, and they support these goals better than XML, I will be happy rather than worried.

From this perspective, my reaction to JSON is a combination of "Yay" and "Sigh".

It's "Yay", because for important use cases JSON is dramatically better than XML.  In particular, JSON shines as a programming language-independent representation of typical programming language data structures.  This is an incredibly important use case and it would be hard to overstate how appallingly bad XML is for this. The fundamental problem is the mismatch between programming language data structures and the XML element/attribute data model of elements. This leaves the developer with three choices, all unappetising:

  • live with an inconvenient element/attribute representation of the data;
  • descend into XML Schema hell in the company of your favourite data binding tool;
  • write reams of code to convert the XML into a convenient data structure.

By contrast with JSON, especially with a dynamic programming language, you can get a reasonable in-memory representation just by calling a library function.

Norman argues that XML wasn't designed for this sort of thing. I don't think the history is quite as simple as that. There were many different individuals and organisations involved with XML 1.0, and they didn't all have the same vision for XML. The organisation that was perhaps most influential in terms of getting initial mainstream acceptance of XML was Microsoft, and Microsoft was certainly pushing XML as a representation for exactly this kind of data. Consider SOAP and XML Schema; a lot of the hype about XML and a lot of the specs built on top of XML for many years were focused on using XML for exactly this sort of thing.

Then there are the specs. For JSON, you have a 10-page RFC, with the meat being a mere 4 pages. For XML, you have XML 1.0, XML Namespaces, XML Infoset, XML Base, xml:id, XML Schema Part 1 and XML Schema Part 2. Now you could actually quite easily take XML 1.0, ditch DTDs, add XML Namespaces, xml:id, xml:base and XML Infoset and end up with a reasonably short (although more than 10 pages), coherent spec. (I think Tim Bray even did a draft of something like this once.) But in 10 years the W3C and its membership has not cared enough about simplicity and coherence to take any action on this.

Norman raises the issue of mixed content. This is an important issue, but I think the response of the average Web developer can be summed up in a single word: HTML. The Web already has a perfectly good format for representing mixed content. Why would you want to use JSON for that?  If you want to embed HTML in JSON, you just put it in a string. What could be simpler? If you want to embed JSON in HTML, just use <script> (or use an alternative HTML-friendly data representation such as microformats). I'm sure Norman doesn't find this a satisfying response (nor do I really), but my point is that appealing to mixed content is not going to convince the average Web developer of the value of XML.

There's a bigger point that I want to make here, and it's about the relationship between XML and the Web.  When we started out doing XML, a big part of the vision was about bridging the gap from the SGML world (complex, sophisticated, partly academic, partly big enterprise) to the Web, about making the value that we saw in SGML accessible to a broader audience by cutting out all the cruft. In the beginning XML did succeed in this respect. But this vision seems to have been lost sight of over time to the point where there's a gulf between the XML community and the broader Web developer community; all the stuff that's been piled on top of XML, together with the huge advances in the Web world in HTML5, JSON and JavaScript, have combined to make XML be perceived as an overly complex, enterprisey technology, which doesn't bring any value to the average Web developer.

This is not a good thing for either community (and it's why part of my reaction to JSON is "Sigh"). XML misses out by not having the innovation, enthusiasm and traction that the Web developer community brings with it, and the Web developer community misses out by not being able to take advantage of the powerful and convenient technologies that have been built on top of XML over the last decade.

So what's the way forward? I think the Web community has spoken, and it's clear that what it wants is HTML5, JavaScript and JSON. XML isn't going away but I see it being less and less a Web technology; it won't be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.

In the short-term, I think the challenge is how to make HTML5 play more nicely with XML. In the longer term, I think the challenge is how to use our collective experience from building the XML stack to create technologies that work natively with HTML, JSON and JavaScript, and that bring to the broader Web developer community some of the good aspects of the modern XML development experience.

2010-02-12

A tour of the open standards used by Google Buzz

The thing I find most attractive about Google Buzz is its stated commitment to open standards:

We believe that the social web works best when it works like the rest of the web — many sites linked together by simple open standards.

So I took a bit of time to look over the standards involved. I’ll focus here on the standards that are new to me.

One key design decision in Google Buzz is that individuals in the social web should be identifiable by email addresses (or at least strings that look like email addresses).  On balance I agree with this decision: although it is perhaps better from a purist Web architecture perspective to use URIs for this, I think email addresses work much better from a UI perspective.

Google Buzz therefore has some standards to address the resulting discovery problem: how to associate metadata with something that looks like an email address. There are two key standards here:

  • XRD. This is a simple XML format developed by the OASIS XRI TC for representing metadata about a resource in a generic way. This looks very reasonable and I am happy to see that it is free of any XRI cruft. It seems quite similar to RDDL.
  • WebFinger. This provides a mechanism for getting from an email address to an XRD file.  It’s a two-step process based on HTTP.  First of all you HTTP get an XRD file from a well-known URI constructed using the domain part of the email address (the well-known URI follows the Defining Well-Known URIs and host-meta Internet Drafts). This per-domain XRD file provides (amongst other things) a URI template that tells you how to construct a URI for an email address in that domain; dereferencing this URI will give you an XRD representation of metadata related to that email address.  There seem to be some noises about a JSON serialization, which makes sense: JSON seems like a good fit for this problem. 

One of the many interesting things you can do with such a discovery mechanism is to associate a public key with an individual.  There’s a spec called Magic Signatures that defines this.  Magic Signatures correctly eschews all the usual X.509 cruft, which is completely unnecessary here; all you need is a simple RSA public key.  My one quibble would be that it invents its own format for public keys, when there is already a perfectly good standard format for this: the DER encoding of the RSAPublicKey ASN.1 structure (defined by RFC 3477/PKCS#1), as used by eg OpenSSL.

Note that for this to be secure, WebFinger needs to fetch the XRD files in a secure way, which means either using SSL or signing the XRD file using XML-DSig; in both these cases it is leveraging the existing X.509 infrastructure. The key architectural decision here is to use the X.509 infrastructure to establish trust at the domain level, and then to use Web technologies to extend that chain of trust from the domain to the individual. From a deployment perspective, I think this will work well for things like Gmail and Facebook, where you have many users per domain.  The challenge will be do make it work well for things like Google Apps for your Domain, where the number of users per domain may be few.  At the moment, Google Apps requires the domain administrator only to set up some DNS records.  The problem is that DNS isn’t secure (at least until DNSSEC is widely deployed).  Here’s one possible solution: the user’s domain (e.g. jclark.com) would have an SRV record pointing to a host in the provider’s domain (e.g. foo.google.com); the XRD is fetched using HTTP, but is signed using XML-DSig and  an X.509 certificate for the user’s domain.  The WebFinger service provider (e.g. Google) would take care of issuing these certificates, perhaps with flags to limit their usage to WebFinger (Google already verifies domain control as part of the Google Apps setup process). The trusted roots here might be different from the normal browser vendor determined HTTPS roots.

The other part of Magic Signatures is billed as a simpler alternative to XML-DSig which also works for JSON. The key idea here is to avoid the whole concept of signing an XML information item and thus avoid the need for canonicalization.  Instead you sign a byte sequence, which is encoded in base64 as the content of an XML element (or as a JSON string).  I don’t agree with the idea of always requiring base64 encoding of the content to be signed: that seems to unnecessarily throw away many of the benefits of a textual format.  Instead, when the byte sequence that you are signing is representing a Unicode string, you should be able to represent the Unicode string directly as the content of an XML element or as a JSON string, using the built-in quoting mechanisms of XML (character references/entities and CDATA sections) or JSON. The Unicode string that results from XML or JSON parsing would be UTF-8 encoded before the standard signature algorithm is applied. A more fundamental problem with Magic Signatures is that it loses the key feature of XML-DSig (particularly with enveloped signatures) that applications that don’t know or care about signing can still understand the signed data, simply by ignoring the signature.  I completely sympathize with the desire to avoid the complexity of XML-DSig, but I’m unconvinced that Magic Signatures is the right way to do so. Note that XRD has a dependency on XML-DSig, but it specifies a very limited profile of XML-DSig, which radically reduces the complexity of XML-DSig processing. For JSON, I think i

There are also standards that extend  Atom. The simplest are just content extensions:

  • Atom Activity Extensions provides semantic markup for social networking activities (such as "liking" something or posting something). This makes good sense to me.
  • Media RSS Module provides extensions for dealing with multimedia content. These were originally designed by Yahoo for RSS. I don't yet understand how these interact with existing Atom/AtomPub mechanisms for multimedia (content/@src, link).

There are also protocol extensions:

  • PubSubHubbub provides a scalable way of getting near-realtime updates from an Atom feed. The Atom feed includes a link to a “hub”.  An aggregator can then register with hub to be notified when a feed is updated. When a publisher updates a feed, it pings the hub and the hub then updates all the aggregators that have registered with it.  This is intended for server-based aggregators, since the hub uses HTTP POST to notify aggregators.
  • Salmon makes feed aggregation two-way.  Suppose user A uses only social networking site X and user B uses only social networking site Y. If user A wants to network with B, then typically either A has to join Y or B has to join X.  This pushes the world in the direction of having one dominant social network (i.e. Facebook). In the long-term I don’t think this is a good thing.  The above extensions solve part of the problem. X can expose a profile for A that links to an Atom feed, and Y can use this to provide B with information about A. But there’s a problem.  Suppose B wants to comment on one of A’s entries.  How can Y ensure that B’s comment flows back to X, where A can see it?  Note that there may be another user C on another social networking site Z that may want to see B’s comment on A’s entry. The basic idea is simple: the Atom feed for A exposed by X links to a URI to which comments can be posted.  The heavy lifting of Salmon is done by Magic Signatures.  Signing the Atom entries is the key to allowing sites to determine whether to accept comments.

Google seems to planning to use the Open Web Foundation (OWF) for some of these standards.  Although the OWF’s list of members includes many names that I recognize and respect, I don’t really understand why we need the OWF. It seems very similar to the IETF in its emphasis on individual participation.  What was the perceived deficiency in the IETF that motivated the formation of the OWF?

2010-01-02

XML Namespaces

One of my New Year’s resolutions is to blog more.  I don’t expect I’ll have much more success with this than I usually do with my New Year’s resolutions, but at least I can make a  start.

I have been continuing to have a dialog with some folks at Microsoft about M.  This has led me to do a lot of thinking about what is good and bad about the XML family of standards.

The standard I found it most hard to reach a conclusion about was XML Namespaces.  On the one hand, the pain that is caused by XML Namespaces seems massively out of proportion to the benefits that they provide.  Yet, every step on the process that led to the current situation with XML Namespaces seems reasonable.

  1. We need a way to do distributed extensibility (somebody should be able to choose a name for an element or attribute that won’t conflict with anybody else’s name without having to check with some central naming).
  2. The one true way of naming things on the Web is with a URI.
  3. XML is supposed to be human readable/writable so we can’t expect people to put URIs in every element/attribute name, so we need a shorter human-friendly name and a way to bind that to a URI.
  4. Bindings need to nest so that XML Namespace-generating processes can stream, and so that one document can easily be embedded in another.
  5. XML Namespace processing should be layered on top of XML 1.0 processing.
  6. Content and attribute values can contain strings that represent element and attribute names; these strings should be handled uniformly with names that the XML parser recognizes as element and attribute names.

I would claim that the aspect of XML Namespaces that causes pain is the URI/prefix duality: the thing that occurs in the document (the prefix + local name) is not the same as the thing that is semantically significant (the namespace URI + local name).  As soon as you accept this duality, I believe you are doomed to a significant extra layer of complexity.

The need for this duality stemmed from the use of URIs for names. As far as I remember, there was actually no discussion in the XML WG on this point when we were doing XML Namespaces: it was treated as axiomatic that URIs were the right thing to use here. But this is where I believe XML Namespaces went wrong.

From a purely practical point of view, the argument for naming namespaces with URIs is that you can do a GET on the URI and get something human- or machine-readable back that tells you about the semantics of the namespace.  I have two responses to this:

  • This is a capability that is occasionally useful, but it’s not that useful.  The utility here is of a completely different order of magnitude compared to the disutility that results from the prefix/URI duality.  Of course, if you are a RDF aficionado, you probably disagree.
  • You can make names resolvable without using URIs.  For example, a MIME-type X/Y can be made resolvable by having a convention that it http://www.iana.org/assignments/media-types/X/Y; or, if you have a dotted DNS-style name (e.g. org.example.bar.foo), you can use DNS TXT records to make it resolvable.

From a more theoretical point of view, I think the insistence on URIs for namespaces is paying insufficient attention to the distinction between instances of things and types of things.  The Web works as well as it does because there is an extraordinarily large number of instances of things (ie Web pages) and a relatively very small number of types of things (ie MIME types).  Completely different considerations apply to naming instances and naming types: both the scale and the goals are completely different.  URIs are the right way to name instances of things on the Web; it doesn’t follow that they are the right way to name types of things.

I also have a (not very well substantiated) feeling that using URIs for namespaces tends to increase coupling between XML documents and their processing.  An example is that people tend to assume that you can determine the XML schema for a document just by looking at the namespace URI of the document element.

What lessons can we draw from this?

For XML, what is done is done.  As far as I can tell, there is zero interest amongst major vendors in cleaning up or simplifying XML. I have only two small suggestions, one for XML language designers and one for XML tool vendors:

  • For XML language designers, think whether it is really necessary to use XML Namespaces. Don’t just mindlessly stick everything in a namespace because everybody else does.  Using namespaces is not without cost. There is no inherent virtue in forcing users to stick xmlns=”…” on the document element.
  • For XML vendors, make sure your tool has good support for documents that don’t use namespaces.  For example, don’t make the namespace URI be the only way to automatically find a schema for a document.

What about future formats?  First, I believe there is a real problem here and a format should define a convention (possibly with some supporting syntax) to solve the problem. Second, a solution that involves a prefix/URI duality is probably not a good approach.

Third, a purely registry-based solution imposes centralization in situations where there’s no need. On the other hand, a purely DNS-based solution puts all extensions on the same level, when in reality from a social perspective extensions are very different: an extension that has been standardized or has a public specification is very different from an ad hoc extension used by a single vendor.  It’s good if a technology encourages cooperation and coordination.

My current thinking is that a blend of registry- and DNS-based approaches would be nice.  For example, you might have something like this:

  • names consist of one or more components separated by dots;
  • usually names consist of a single component, and their meaning is determined contextually;
  • names consisting of multiple components are used for extensions; the initial component must be registered (the registration process can be as lightweight as adding an entry to a wiki, like WHATWG does HTML5 for rel values);
  • there is a well-known URI for each registered initial component;
  • one registered initial component is “dns”: the remaining components are a reversed DNS name (Mark Nottingham’s had a ID like this for MIME types); there’s some way of resolving such a name into a URI.

Some other people’s thinking on this that I’ve found helpful: Mark Nottingham, Jeni Tennison, Tim Bray (and the rest of that xml-dev thread).

2009-01-17

RELAX NG and xml:id

One part of the vision underlying RELAX NG is that validation should not be monolithic: it is not necessary or desirable to have one schema language that can handle every possible kind of validation you might want to do; it is better instead to have multiple specialized languages, each of which does one kind validation, really well. Consistent with this vision, RELAX NG provides only grammar-based validation. There's no implicit claim that other kinds of validation aren't useful and important.

One kind of validation that is clearly useful and important and that can't be done by grammars is checking of cross-references. One possibility is to use Schematron for this. The designers of RELAX NG anticipated that there would be a little schema language specialized to this, which would be created as part of the ISO DSDL effort (as part 6); this wouldn't be a million miles from the kind of thing that XSD provides with xs:key/xs:unique/xs:keyref. Unfortunately this hasn't happened yet.

Since DTDs provide ID/IDREF checking and we wanted people to be able to move easily from DTDs to RELAX NG, we felt we had to provide some transitional support for ID/IDREF checking while awaiting the ultimate "right" solution. We therefore provided a separate, optional spec called RELAX NG DTD Compatibility. Amongst other things, this defines a way in which RELAX NG processors can optionally provide DTD-compatible ID/IDREF checking based on the datatypes of attributes declared in the schema. Note that this can't handled by the XSD datatypes library for RELAX NG, because assignment of types in the schema to values in the instance is not part of the RELAX NG model of validation.

When defining RELAX NG DTD compatibility, we took a fairly hard line about being DTD-compatible. In particular, we made it a requirement that you should be able to generate a DTD subset from the RELAX NG schema that would perform the same type assignment that the process defined by the spec would perform. This creates some problems when you use DTD Compatibility in conjunction with wildcards (which of course aren't a DTD feature). For example:

start = element doc { p* }
p = element p { id?, any* }
id = attribute id { xsd:ID }
any = element * { attribute * { text }*, (any|text)* }

will get a error about conflicting ID-types for p/@id.  This is because the schema allows <p> to contain a <p> element with an id attribute that doesn't have type ID. Instead you would have to write:

start = element doc { p* }
p = element p { id?, any* }
id = attribute id { xsd:ID }
any = element * - p { attribute * { text }*, (any|text)* }

Several years after the DTD compatibility spec was finished, the W3C came out with the xml:id Recommendation. The spec mentions RELAX NG in a non-normative appendix and encourages authors "to declare attributes named xml:id with the type xs:ID". Now on the face of it, this seems pretty reasonable advice.  Unfortunately, from the point of the RELAX NG DTD Compatibility spec it's precisely the wrong thing to do.  For example, this

start = element doc { p* }
p = element p { id?, any* }
id = attribute xml:id { xsd:NCName }
any = element * { attribute * { text}*, (any|text)* }

will work perfectly with RELAX NG with or without DTD compatibility. The XML processor does the xml:id checking, and RELAX NG can ignore ID/IDREFs. But if instead you follow the xml:id Recommendation's suggestion and do:

start = element doc { p* }
p = element p { id?, any* }
id = attribute xml:id { xsd:ID }
any = element * { attribute * { text}*, (any|text)* }

a RELAX NG validator that implements RELAX NG DTD compatibility will give you an error about conflicting ID-types p/@xml:id. You might think you could do

start = element doc { p* }
p = element p { id?, any* }
id = attribute xml:id { xsd:ID }
any = element * { attribute * - xml:id { text}*, id?, (any|text)* }

but that won't work either, because although you can now write a DTD subset that does equivalent type assignment for p, you can't do it for the other elements.

(The xml:id Recommendation also says in the RELAX NG section that "A document that uses xml:id attributes that have a declared type other than xs:ID will always generate xml:id errors.". I don't see why: the xml:id processor is quite likely to be part of the XML parser, which doesn't know anything about RELAX NG, nor does RELAX NG know anything about xml:id.)

Back when RELAX NG DTD compatibility spec came out, I implemented support for the ID/IDREF checking part of DTD Compatibility in Jing.  I also decided to make Jing enforce this by default. There's a -i switch to turn it off. Before xml:id came along, this seemed to work OK: if a schema author specifies ID/IDREF in a RELAX NG schema then they usually want ID/IDREFs to be checked and RELAX NG DTD Compatibility was the only thing that could do this checking. With xml:id this no longer works well: if you

  • use xml:id
  • declare xml:id attributes as type xsd:ID in the RELAX NG schema
  • use wildcards in your RELAX NG schema
  • don't use any special options to Jing

you are very likely to get an error from Jing.

At first, my plan was simply to change Jing not to enforce DTD Compatibility by default. However, Alex Brown pointed out that this isn't completely satisfactory: people who are coming from DTDs and aren't using xml:id lose the sensible ID/IDREF checking that they might reasonably expect to happen by default. So now I'm thinking that a better solution might be to add two boolean options to Jing, both of which would be enabled by default.

The first option would be to make it a warning rather than an error if the schema does not use ID/IDREF in a DTD-compatible way. (If the schema is DTD-compatible, then duplicate IDs or IDREFs to non-existent IDs would still be errors.)

The second option would tell Jing to be "xml:id aware". This would have several effects.

  • It would require attributes named xml:id to be declared with type xsd:ID (or with the ID type from the datatype library defined by the DTD compatibility spec). This isn't strictly necessarily, but it would seem to minimize confusion and be in keeping with the spirit of the xml:id Recommendation. It's slightly tricky to decide what this means with various unusual RELAX NG wildcards. It is obvious that attribute xml:id { text} is an error.  But the following are not all obvious to me:
    • attribute xml:id|id { text }
    • attribute * { text }
    • attribute xml:* { text }
    • attribute *|xml:id { text }
  • When checking whether you can generate an equivalent DTD subset, xml:id attributes would be ignored. In the terms defined by the RELAX NG DTD Compatibility spec, you would ignore xml:id attributes when determining whether the schema is compatible with the ID/IDREF feature.
  • When checking uniquess of IDs, and when checking IDREFs, an attribute named xml:id would always be treated as an ID attribute.

It might also be a good idea to revise the RELAX NG DTD compatibility spec to be xml:id aware in this way.

2008-12-14

Saxon performance

Michael Kay has posted  a link to a new paper  of his on Saxon performance. Interesting not just for XQuery/XSLT performance but also for XML performance in general.

2008-11-23

Some thoughts on the Oslo Modeling Language

Microsoft recently introduced Oslo.  Microsoft seems to have designed Oslo to replace some of things it now uses XML for.  Since Microsoft have been one of the biggest supporters of XML, I think it's worth looking at what they've come up with.

Overview of M

The key part of Oslo is the "M" language, which Microsoft calls a "modeling language".  This integrates quite a broad range of functionality:

  • an abstract data model (analogous to the XML Infoset); M uses the term "value" to describe an instance of this abstract data model
  • a syntax for writing values (analogous to XML 1.0)
  • a type system which provides language constructs for describing constraints on values (analogous to XML Schema)
  • language constructs for querying values (analogous to XQuery)

I guess this is a similar range of functionality to SQL (although there are no constructs for doing updates yet).

I'm going to give a brief overview of M as I understand it.  This is based purely on the spec. Please comment if I've misunderstood anything.

Let's start with the abstract data model. There are three kinds of value

  • simple values
  • collections
  • entities

Simple values are what you would expect (Unicode strings, numbers, booleans, dates etc). An important simple value is null, which is distinct from any other simple value.

A collection is a bag of values: it is unordered, it can have duplicates and it can contain arbitrary values.

An entity is a map from Unicode strings to values; each string/value pair is called a field.

A key feature of the abstract data model is that it's a graph rather than a tree.  When the value of a field is an entity, the field conceptually holds a reference to that entity.  I believe the same goes for collections, but I haven't completely grokked how identity works in M.

Simple values have the sort of syntax you would expect:

  • "foo" is a Unicode string (M calls string values Text)
  • 123 is an integer
  • true is the boolean true value
  • 2008-11-22 is a date
  • null is the null value

A collection is written in curly brackets with items separated by commas: { 1, 2, 3 } is a collection of three integers.

An entity uses a {field-name = field-value} syntax: { x = 2, y = 3 } is an entity with two fields. If the field name isn't an identifier, you have to surround it in square brackets: { [x coordinate] = 2, [y coordinate] = 3}.

You can label an entity or a collection and then use that label to specify a reference to that entity or collection. You label a collection just by putting an identifier before the opening curly brace.  So

  { jack { name = "Jack", spouse = jill }, jill { name = "Jill", spouse = jack } }

would be a collection of two entities, where the spouse field of each entity is a reference to the other entity, and

   loop { loop }

would denote a collection with a single member, which is a reference to itself.

A type in M can be thought of as being the collection of the values that conform to the type, so one way to specify a type is just to explicitly specify that collection. So, we could define a Boolean type like this:

  type Logical { true, false }

In fact, M has predefined types corresponding to each kind of simple value.  For entities, you specify a type by specifying the fields. So if we define Point like this:

  type Point { x; y; }

then any entity that has both an x field and a y field is a Point.  Note that entity types are open: an entity that has a z field as well as an x and a y field would conform to the Point type. You can also specify types for the fields:

type Point {
    x : Number;
    y : Number;
}

Types can be recursive:

type Node {
  left : Node;
  right : Node;
}

Collections can be specified using the * and + operators: Integer* is a collection of zero or more integers. There's also a ? operator, but it doesn't have anything to do with collections: T? is equivalent to the union of T and { null }.

So far, nothing very exciting. What makes M interesting is that it has a rich functional expression language that can be used for describing instances, types and queries. The analogy in the XML world would be the way XPath is used in instances via XPointer, in schema languages like Schematron and XSD 1.1 and in XQuery.

As well as the obvious kinds of expressions on simple values, you can have expressions on collections:

  • C | D is the union of C and D
  • C & D is the intersection of C and D
  • v in C tests whether v is a member of C
  • C <= D tests where C is a subset of D

Most importantly,

  C where E

returns a collection containing those members of C for which E evaluates to true; the keyword value is bound to the member of C for which E is being evaluated.  So

{ 1, 2, 3, 4 } where value % 2 == 0

returns { 2, 4 }.

What's really powerful is that expressions can be used on types in a similar way to how they are used on collections.  If we want a type corresponding to even numbers, we can just do:

  type Even : Integer where value % 2 == 0;

You can apply a where expression to an entity type definition to constrain the fields:

type ZeroSum {
   x: Number;
   y: Number;
} where x + y == 0;

You can also do something like:

type Name {
  firstName: Text;
  lastName: Text;
}
type Person {
  dateOfBirth: Date;
} where value in Name;

Fields can have default values, for example:

type Person {
  dateOrBirth : Date? = null;
}

In fact, when the type of a field is nullable (has null as one of its possible values), it automatically gets a default value of null. Similarly, when the type of a field is a collection that may be empty, it automatically gets a default value of an empty collection. This is quite clever: it makes fields declared as ? or * work nicely for both the provider and the consumer; the provider can leave the fields out, as would be natural for the provider, but the consumer always gets a sensible value for the field.

So we've seen how entity types can specify a default value.  But how does an entity get connected with an entity type so that the default value can be created?   Typing in M is structural.  An entity in M doesn't inherently belong to any type other than Entity.  Like a pattern in RELAX NG, a type in M describes a possible shape for a value, which any given value may or may not have, but a value isn't created with any particular type (other than one of the fundamental built in types).

M solves this by having a notion of type "ascription".  An expression "v : T" ascribes the type T to the value v. M is functional so this doesn't mutate v, rather it returns a new value that has an augmented view of v as specified by T.  So whereas

  { firstName: "James", lastName: "Clark" }.dateOfBirth

will throw an error,

  ({ firstName: "James", lastName: "Clark" } : Person).dateOfBirth

will return null (assuming Person declares dateOfBirth as nullable). (I would guess type ascription also does coercions on simple values.) Note that this implies that types in M aren't simply collections of values. M also allows entity types to have computed fields, which are virtual fields whose values are computed from the value of other fields.

The last area of M I want to look at is identity.  You can specify what determines the identity for a entity:

type Person {
  firstName: Text;
  lastName: Text;
  dateOfBirth: Date? 
} where identity(firstName, lastName);

This means you can't have two Persons with the same firstName and lastName.  The obvious question is: within what scope?

To answer, we have to look at the top-level layer that M provides.  Values don't exist in isolation.  Everything in M has to exist within a module.  A module is a sort of top-level static entity.  The fields of a module are called "extents".  The scope of an identity constraint is an extent.  So, if you do:

module Employee {
  type Person {
    firstName: Text;
    lastName: Text;
    dateOfBirth: Date? 
  } where identity(firstName, lastName);
  Persons: Person+
}

then it would be an error if within the Persons extent, there were two Person entities with the same firstName and lastName. There's also a way to automatically give a field a unique id:

type Name {
  id : Integer32 = AutoNumber(); 
  firstName: Text;
  lastName: Text;
} where identity id;

I don't really have anything to say about the query part: it's quite close to LINQ.

My thoughts about M

There's quite a lot about M that I like.  Mostly it seems pretty clean.  The type system is powerful. Structural typing is clearly the right approach for something like M.  I like the way that constraints that can be checked statically are seamlessly blended with constraints that will need to be checked dynamically.  Static type checking is good, but there's no need to rub your users faces in the limitations of your static type checker. (It's interesting to see C# 4.0 moving in a similar direction.)

It's obviously very early days for M, and there's still lots of scope for improvement.  Microsoft's initial implementation of M targets SQL and works a bit like a database.  But clearly there's potential for using something like M in quite different contexts, e.g. for exchanging data on the Web (like what I talked about some time ago with TEDI).  But several aspects of the current design seem to reflect the initial database focus. For example, the current top level wrapper of modules/entities makes sense for a database application of M, but wouldn't work so well if you were using M for exchanging data on the Web.

The spec needs some fleshing out.  Microsoft's current implementation is a long way from implementing the full language.  I am skeptical whether the language as currently specified can be fully implemented.  For example, can you implement the test for whether one type is a subtype of another type so that it works in non-exponential time for two arbitrary types?

I see several major things missing in M, whose absence might be acceptable for a database application of M, but which would be a significant barrier for other applications of M.  Most fundamental is order. M has two types of compound value, collections and entities, and they are both unordered.  In XML, unordered is the poor relation of ordered.  Attributes are unordered, but attributes cannot have structured values. Elements have structure but there's no way in the instance to say that the order of child elements is not significant.  The lack of support for unordered data is clearly a weakness of XML for many applications.  On the other hand, order is equally crucial for other applications.  Obviously, you can fake order in M by having index fields in entities and such like.  But it's still faking it.  A good modeling language needs to support both ordered and unordered data in a first class way.  This issue is perhaps the most fundamental because it affects the data model.

Another area where M seems weak is identity. In the abstract data model, entities have identity independently of the values of their fields.  But the type system forces me to talk about identity in an SQL-like way by creating artificial fields that duplicate the inherent identity of the entity. Worse, scopes for identity are extents, which are flat tables.  Related to this is support for hierarchy. A graph is a more general data model than a tree, so I am happy to have graphs rather than trees. But when I am dealing with trees, I want to be able to say that the graph is a tree (which amounts to specifying constraints on the identity of nodes in the graph), and I want to be able to operate on it as a tree, in particular I want hierarchical paths.

One of the strengths of XML is that it handles both documents and data. This is important because the world doesn't neatly divide into documents and data.  You have data that contains documents and document that contain data.  The key thing you need to model documents cleanly is mixed text. How are you going to support documents in M?  The lack of support for order is a major problem here, because ordered is the norm for documents.

A related issue is how M and XML fit together. I believe there's a canonical way to represent an M value as an XML document. But if you have data that's in XML how do you express it in M? In many cases, you will want to translate your XML structure into an M structure that cleanly models your data.  But you might not always want to take the time to do that, and if your XML has document-like content, it is going to get ugly.  You might be better off representing chunks of XML as simple values in M (just as in the JSON world, you often get strings containing chunks of HTML).  M should make this easy.  You could solve this elegantly with RELAX NG (I know this isn't going to happen given Microsoft's commitment to XSD, but it's an interesting thought experiment): provide a function that allows you to constrain a simple value to match a RELAX NG pattern expressed in the compact syntax (with the compact syntax perhaps tweaked to harmonize with the rest of M's syntax) and use M's repertoire of simple types as a RELAX NG datatype library.

Finally, there's the issue of standardization.  The achievement of XML in my mind isn't primarily a technical one.  It's a social one: getting a huge range of communities to agree to use a common format.  Standardization was the critical factor in getting that agreement.  XML would not have gone anywhere as a single vendor format. It was striking that the talks about Oslo at the PDC made several mentions of open source, and how Microsoft was putting the spec under its Open Specification Promise so as to enable open source implementations, but no mentions of standardization.  I can understand this: if I was Microsoft, I certainly wouldn't be keen to repeat the XSD or OOXML experience. But open source is not a substitute for standardization.

2008-11-17

What's allowed in a URI?

Java 1.4 introduced the java.net.URI which provides RFC 2936-compliant URI handling. I thought I should try to fix Jing and Trang to use this. So I've been looking through all the relevant specs to figure out to what extent I can leave things to java.net.URI.

It's convenient to begin with XLink.  Section 5.4 requires the value of the href attribute to be a URI reference after certain characters that are disallowed by RFC 2396 are escaped. These are described as

all non-ASCII characters, plus the excluded characters listed in Section 2.4 of IETF RFC 2396, except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in IETF RFC 2732

If we look at 2.4.3 of RFC 2396 (why does XLink reference section 2.4 rather than 2.4.3?), we see the following sets of characters excluded:

  • control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
  • space       = <US-ASCII coded character 20 hexadecimal>
  • delims      = "<" | ">" | "#" | "%" | <">
  • unwise     = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Section 3 of RFC 2732 (which modifies RFC 2396 to handle IPv6 addresses)  does indeed allow square brackets by removing them from the 'unwise' set.

Putting these all together, we can distinguish the following categories of characters that are allowed by XLink but not allowed by RFC 2396/RFC 2732

  1. C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents
  2. space (#x20)
  3. disallowed ASCII graphic characters, specifically: <>"{}|\^`
  4. delete (#x7F)
  5. non-ASCII Unicode characters, excluding surrogates #x80-#xD7FF, #xE000-#x10FFFF (XML does not allow #xFFFE and #xFFFF)

Looking at the various XML-related specs, things seem to be nicely aligned:

XSLT 1.0 just references RFC 2396 and doesn't say anything about escaping (as regards xsl:include and xsl:import). That seems like a bug to me.  Erratum E39 adds the following to the first paragraph of the spec:

For convenience, XML 1.0 and XML Names 1.0 references are usually used. Thus, URI references are also used though IRI may also be supported. In some cases, the XML 1.0 and XML 1.1 definitions may be exactly the same.

This seems to be intended to extend it to allow IRIs, though it seems like a bit of a hack: there's no reference to the IRI spec, and I don't see how it's "Thus, ". In any case, XSLT 2.0 gets it right: it references xs:anyURI.

RFC 2396 has been updated by RFC 3986.  This no longer has a section describing excluded characters, but I believe I am right in saying that the set of Unicode characters that cannot occur anywhere in a URI as defined by RFC 3986 is precisely the union of my categories 1 through 5.

Next we have the IRI spec, RFC 3987. This defines:

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

It adds ucschar to the set of unreserved characters and adds iprivate to what's allowed in the query of a URI. The characters in my category 5 that are in neither ucschar nor iprivate are as follows:

  • C1 controls: #x80 - #x9F
  • the 66 Unicode noncharacters: #xFDD0 - #xFDEF, and any code point whose bottom 16 bits are FFFE or FFFF
  • Specials: #xFFF0 - #xFFFD; these fall into three groups, unassigned specials (#xFFF0 - #xFFF8), annotation characters (#xFFF9 - #xFFFB) and replacement characters (#xFFFC - #xFFFD)
  • Language tags: #xE0000 - #xE0FFF

I can buy controls and noncharacters being excluded, but the other two seem like over-engineering to me. The arguments for excluding these could equally be applied to various other weird Unicode characters.  You don't want to have to change the definition of an IRI whenever Unicode adds some new weird character.

RFC 3987 also has the following in Section 3.2:

Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", "}", "|", "\", "^", and "`"

Those characters correspond to my categories 2 and 3. Overall there are a lot of subtle differences between IRIs and the thing that is currently allowed by XML specs.

Fortunately there is a draft of a new version of the IRI spec. This introduces Legacy Extended IRI (LEIRI) references, which defines ucschar as:

   ucschar        = " " / "<" / ">" / '"' / "{" / "}" / "|"
                     / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
                     / %xE000-FFFD / %x10000-10FFFF

which exactly corresponds to my categories 1 to 5.

LEIRIs seem like a very useful innovation.  XML-related specs such as RELAX NG that referenced or incorporated the XLink wording will be able to simply reference RFC 3987bis and say that URI references MUST be LEIRIs and SHOULD be IRIs.

Finally we are ready to look at java.net.URI. This allows URIs to contain an additional set of "other" characters which consist of non-ASCII characters with the exception of:

  • C1 controls (#x80 - #x9F)
  • Characters with a category of Zs, Zl or Zp

This means that if you want to give an LEIRI such as an XML system identifier to java.net.URI you first need to percent encode any of the following:

  • the following ASCII graphic characters: <>"{}|\^`
  • C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents
  • space (#x20)
  • delete (#x7F)
  • C1 controls (#x80 - #x9F)
  • Characters with a category of Zs, Zl or Zp

All except the first can be tested with Character.isISOControl(c) || Character.isSpace(c).

Note that you don't want to blindly percent encode all non-ASCII characters because that will unnecessarily make IRIs containing non-ASCII characters unintelligible to humans.

2008-11-09

Working on Jing and Trang

I've been back to working on Jing and Trang for about a month now. It would be something of an understatement to say that they were badly in need of some maintenance love.

I started a jing-trang project on Google Code to host future development. There are new releases of both Jing and Trang in the downloads section of the project site. These have been out for about 10 days, and there have been a reasonable number of downloads, and no reports of any major bugs, so I think these should be fairly solid.  (Interestingly, the number of downloads of Trang have been running at about twice those of Jing.)

It's been 5 years since the last release, so what new and exciting features are there? Well, actually, in the current release, none.  My work for that release was focused on two areas:

  • getting things to work properly with current versions of Java and other dependencies;
  • getting the source code structure and build system into reasonable shape.

The second was a lot of work. The code base for Jing and Trang had evolved over a number of years, incorporating various bits of functionality that were independent of each other to various degrees; its structure only made any sense from a historical perspective.  The current structure is now nicely modular.  I converted my CVS repository to subversion before I started moving things around, so the complete history is available in the project repository. For people who want to stay on the bleeding edge, it's now really easy to check out and build from subversion.

My natural tendencies are much more to the cathedral than to the bazaar, but I'm trying to be more open.  I'm pleased to say that are already two committers in addition to myself. There's a commercial XML editor called <oXygen/>, which uses Jing and Trang to support RELAX NG. The main guy behind that, George Bina, had made a number of useful improvements. In particular, he upgraded Jing's support for the Namespace Routing Language to its ISO-standardized version, which is called NVDL (you might want to start with this NVDL tutorial rather than the spec).  This is now on the trunk. The other committer is Henri Sivonen, who has been using Jing in his Validator.nu service.

My goals for the next release are:

  • complete support for NVDL (I think the only missing feature is inline schemas)
  • support for the ISO-standardized version of Schematron
  • customizable resource resolution support (so that, for example, you can use XML catalogs)
  • support standard JAXP XML validation API (javax.xml.validation)
  • more code cleanup

Please use the issue tracker to let me know what you would like.  Google Code has a system that allow you to vote for issues: if you are logged in, which you can do with a regular Google account, each issue will be displayed with a check box next to a star; checking this box "stars" the issue for you, which both adds a vote for the issue and gets you email notifications about changes to it.

I haven't started any project-specific mailing lists yet.  For developers, the issue tracker seems to be enough at the moment.  For users, Jing and Trang are within the scope of the existing RELAX NG Users mailing list on Yahoo Groups.

2008-10-17

XML 1.0 5th edition

Rather late in the day, I sent a comment in on the proposed XML 1.0 5th Edition. For some background, read Norman Walsh, John Cowan, David Carlisle and Henry Thompson.

There is a real problem here, and it's partly my fault. If we had had a bit more foresight ten years ago, we would have made the 1st edition of XML 1.0 say what is now being proposed for the 5th Edition. I know that the XML Core WG are trying to do the right thing, but I really don't think this is a good idea.

I think you've got to look at the impact of the change not just on XML 1.0 but on the whole universe of specs that are built on top of XML 1.0.  In an ideal world, all the specs that refer to XML 1.0 would have carefully chosen whether to make a dated or an undated reference to XML 1.0, and would have done so consistently and with a full consideration of the consequences of the choice.  In practice, I don't believe this has happened.  Indeed, before the 5th edition, I believe very few people would have considered that XML might make a fundamental change to its philosophy about which characters were allowed in names while still keeping the same version number.

Even W3C specs don't get this right. In particular, XML Namespaces 1.0 gets completely broken by this (as my comment explains).

Now you can argue that the breakage and chaos that the 5th edition would cause is due to bugs in the specs that reference XML 1.0. But that doesn't make the breakage any less real.

I also have the rather heretical view that the benefits of the change are small.  In terms of Unicode support, what's vitally important is that any Unicode character is allowed in attribute values and character data.  And XML 1.0 has always supported that.  This change is just about the Unicode characters allowed in element and attribute  names (and entity names and processing instruction targets).

I see relatively little use of non-ASCII characters in element and attribute names.  A user who is technical enough to deal with raw XML markup can deal with ASCII element/attribute names.  For less technical users who want to see element/attribute names in their native language, using native language markup is not a good solution, because it only allows a document or schema to be localized for a single language. An XML editor can provide a much better solution by supporting schema annotations that allow an element or attribute to be given friendly names in multiple languages.  So a Thai user editing a document using the schema can work with Thai element/attribute names, and an English user working with the same document can see English names.

This is just following basic I18N principles of storing/exchanging information in a language neutral form, and then localizing it when you present it to a particular user. (This is the same reason why it's perfectly OK from an I18N perspective for XML Schema Datatypes just to support one specific non-localized format for dates/times.)

Perhaps this is part of the reason why there was so little enthusiasm for XML 1.1, and why there seems to be little interest in doing the 5th edition change as an XML 1.2.

One case where I can see real value in adopting  more permissive rules for names is in XML Schema Datatypes, because this relates to character data and attribute values.  But it seems like you could easily fix this, without any of the problems that the 5th edition would cause, by introducing a couple of new datatypes into XML Schema.

2007-04-11

Validation not necessarily harmful

Several months ago, Mark Baker wrote an interesting post entitled Validation considered harmful. I agree with many of the points he makes but I would draw different conclusions. One important point is that when you take versioning into consideration, it will almost never be the case that a particular document will inherently have a single schema against which it should always be validated. A single document might be validated against:
  • Version n of a schema
  • Version n + 1 of a schema
  • Version n of a schema together whatever the versioning policy of version n says future versions may add
  • What a particular implementation of version n generates
  • What a particular implementation of version n understands
  • The minimum constraints that a document needs in order to be processable by a particular implementation
Extensibility is also something that increases the range of possible schemas against which it may make sense to validate a document. The multiplicity of possible schemas is the strongest argument for a principle which I think is fundamental:
Validity should be treated not as a property of a document but as a relationship between a document and a schema.
The other important conclusion that I would draw from Mark's discussion is that schema languages need to provide rich, flexible functionality for describing loose/open schemas. It is obvious that DTDs are a non-starter when judged against these criteria. I think it's also obvious that Schematron does very well. I would claim that RELAX NG also does well here, and is better in this respect than other grammar-based schema language, in particular XSD. First, it carefully avoids anything that ties a document to a single schema
  • there's nothing like xsi:schemaLocation or DOCTYPE declarations
  • there's nothing that ties a particular namespace name to a particular schema; from RELAX NG's perspective, a namespace name is just a label
  • there's nothing in RELAX NG that changes a document's infoset
Second, it has powerful features for expressing loose/open schemas:
  • it supports full regular tree grammars, with no ambiguity restrictions
  • it provides namespace-based wildcards for names in element and attribute patterns
  • it provides name classes with a name class difference operator
Together these features are very expressive. As a simple example, this pattern
attribute x { xsd:integer }?, attribute y { xsd:integer }?, attribute * - (x|y) { * }
allows you to have any attribute with any value, except that an x or y attribute must be an integer. A more complex example is the schema for RDF. See my paper on The Design of RELAX NG for more discussion on the thinking underlying the design of RELAX NG. Finally, I have to disagree with the idea that you shouldn't validate what you receive. You should validate, but you need to carefully choose the schema against which you validate. When you design a language, you need to think about how future versions can evolve. When you specify a particular version of a language, you should precisely specify not just what is allowed by that version, but also what may be allowed by future versions, and how an implementation of this version should process things that may be added by future versions. An implementation should accept anything that the specification says is allowed by this version or maybe allowed by a future version, and should reject anything else. The easiest and most reliable way to achieve this is by expressing the constraints of the specification in machine-readable form, as a schema in a suitable schema language, and then using a validator to enforce those constraints. I believe that the right kind of validation can make interoperability over time more robust than the alternative, simpler approach of having an implementation just ignore anything that it doesn't need.
  • Validation enables mandatory extensions. Occasionally you want recipients to reject a document if they don't understand a particular extension, perhaps because the extension critically modifies the semantics of an existing feature. This is what the SOAP mustUnderstand attribute is all about.
  • Validation by servers reduces the problems caused by broken clients. Implementations accepting random junk leads inexorably to other implementations generating random junk. If you have a server that ignores anything it doesn't need, then deploying a new version of the server that adds support for additional features can break existing clients. Of course, if a language has a very unconstrained evolution policy, then validation won't be able to detect many client errors. However, by making appropriate use of XML namespaces, I believe it's possible to design language evolution policies that are both loose enough not to unduly constrain future versions and strict enough that a useful proportion of client errors can be detected. I think Atom is a good example.