2007-04-11

Validation not necessarily harmful

Several months ago, Mark Baker wrote an interesting post entitled Validation considered harmful. I agree with many of the points he makes but I would draw different conclusions. One important point is that when you take versioning into consideration, it will almost never be the case that a particular document will inherently have a single schema against which it should always be validated. A single document might be validated against:
  • Version n of a schema
  • Version n + 1 of a schema
  • Version n of a schema together whatever the versioning policy of version n says future versions may add
  • What a particular implementation of version n generates
  • What a particular implementation of version n understands
  • The minimum constraints that a document needs in order to be processable by a particular implementation
Extensibility is also something that increases the range of possible schemas against which it may make sense to validate a document. The multiplicity of possible schemas is the strongest argument for a principle which I think is fundamental:
Validity should be treated not as a property of a document but as a relationship between a document and a schema.
The other important conclusion that I would draw from Mark's discussion is that schema languages need to provide rich, flexible functionality for describing loose/open schemas. It is obvious that DTDs are a non-starter when judged against these criteria. I think it's also obvious that Schematron does very well. I would claim that RELAX NG also does well here, and is better in this respect than other grammar-based schema language, in particular XSD. First, it carefully avoids anything that ties a document to a single schema
  • there's nothing like xsi:schemaLocation or DOCTYPE declarations
  • there's nothing that ties a particular namespace name to a particular schema; from RELAX NG's perspective, a namespace name is just a label
  • there's nothing in RELAX NG that changes a document's infoset
Second, it has powerful features for expressing loose/open schemas:
  • it supports full regular tree grammars, with no ambiguity restrictions
  • it provides namespace-based wildcards for names in element and attribute patterns
  • it provides name classes with a name class difference operator
Together these features are very expressive. As a simple example, this pattern
attribute x { xsd:integer }?, attribute y { xsd:integer }?, attribute * - (x|y) { * }
allows you to have any attribute with any value, except that an x or y attribute must be an integer. A more complex example is the schema for RDF. See my paper on The Design of RELAX NG for more discussion on the thinking underlying the design of RELAX NG. Finally, I have to disagree with the idea that you shouldn't validate what you receive. You should validate, but you need to carefully choose the schema against which you validate. When you design a language, you need to think about how future versions can evolve. When you specify a particular version of a language, you should precisely specify not just what is allowed by that version, but also what may be allowed by future versions, and how an implementation of this version should process things that may be added by future versions. An implementation should accept anything that the specification says is allowed by this version or maybe allowed by a future version, and should reject anything else. The easiest and most reliable way to achieve this is by expressing the constraints of the specification in machine-readable form, as a schema in a suitable schema language, and then using a validator to enforce those constraints. I believe that the right kind of validation can make interoperability over time more robust than the alternative, simpler approach of having an implementation just ignore anything that it doesn't need.
  • Validation enables mandatory extensions. Occasionally you want recipients to reject a document if they don't understand a particular extension, perhaps because the extension critically modifies the semantics of an existing feature. This is what the SOAP mustUnderstand attribute is all about.
  • Validation by servers reduces the problems caused by broken clients. Implementations accepting random junk leads inexorably to other implementations generating random junk. If you have a server that ignores anything it doesn't need, then deploying a new version of the server that adds support for additional features can break existing clients. Of course, if a language has a very unconstrained evolution policy, then validation won't be able to detect many client errors. However, by making appropriate use of XML namespaces, I believe it's possible to design language evolution policies that are both loose enough not to unduly constrain future versions and strict enough that a useful proportion of client errors can be detected. I think Atom is a good example.

2007-04-09

XML and JSON

I've had some useful feedback on my previous post. I need to take a few days to get clearer in my own mind what exactly it is that I'm trying to achieve and find a crisper way to describe it. In the meantime, I would like to offer a few thoughts about XML and JSON. My previous post came off much too dismissive of JSON. I actually think that JSON does have real value. Some people focus on the ability for browsers to serialize/deserialize JSON natively. This makes JSON an attractive choice for AJAX applications, and I think this has been an important factor in jump starting JSON adoption. But in the longer term, I think there are two other aspects of JSON that are more valuable.
  • JSON is really, really simple, and yet it's expressive enough for many applications. When XML 1.0 came out it represented a major simplification relative to what it aspired to replace (SGML). But over the years, complexity has accumulated, and there's been very little attention given to simplifying and refactoring the XML stack. The result is frankly a mess. For example, it's nonsensical to have DTD defaulting of attributes based on prefix rather than namespace name, yet this is a feature that any conforming XML parser has to implement. It's not surprising that XML is unappealing to a generation of programmers who are coming to it fresh, without making allowances for how it got to be this way. When you look at the bang for the buck provided by XML and compare it with JSON, XML does not look good. The hard question is whether there's anything the XML community can do to improve things that can overcome the inertia of XML's huge deployed base. The XML 1.1 experience is not encouraging.
  • The data model underlying JSON (atomic datatypes, objects/maps, arrays/lists) is a much more natural model for data than an XML infoset. If you're working in a scripting language and read in some JSON, you directly get something that's quite pleasant to work with; if you read in XML, you typically get some DOM-like structure, which is painful to work with (although a bit of XPath can ease the pain), or you have to apply some complex data-binding machinery.
However, I don't think JSON will or should relegate XML to a document-only technology.
  • You can't partition the world of information neatly into documents and data. There are many, many cases where information intended for machine-processing has parts which are intended for human consumption. GData is a great example. The GData APIs handle this in JSON by having strings with HTML/XML content. However, I think the ability of XML to handle documents and data in a uniform way is a big advantage for information of this type.
  • XML's massive installed base gives it an interoperability advantage over any competitive technology. (Unfortunately this also applies to any future cleaned up version of XML.) You don't get the level of adoption that XML has achieved without cost. It requires multiple communities with different objectives to come together and compromise; each community ends up accepting features that are unnecessary cruft from its point of view. This level of adoption also takes time and requires a technology to grow to support new requirements. Adding new features while preserving backwards compatibility often results in a less than elegant design.
  • A range of powerful supporting technologies have been developed for XML. Naturally I have a fondness for the ones that I had a role in developing: XPath, XSLT, RELAX NG. I also can see a lot of value in XPath2, XSLT2 and XQuery. On some days, if I'm in a particularly good mood and I try really hard, I can see value in XSD. Programming languages are more and more acquiring built-in support for XML. Collectively I think these technologies give XML a huge advantage.
  • JSON's primitive datatype support is weak. The semantics of non-integer numbers are unspecified. In XSD terms, are they float, double, decimal or precisionDecimal? Some important datatypes are missing. In particular, I think support for binary data (XSD base64Binary or hexBinary) is critical. Furthermore, the set of primitive datatypes is not extensible. The result is that JSON strings end up being used to encode data that is not logically a string. JSON solves the datatyping problem only to the extent the only non-string datatypes you care about are booleans and integers.
  • JSON does not have anything like XML Namespaces. There are probably many people who see this as an advantage for JSON and certainly XML Namespaces come in for a lot of criticism. However, I'm convinced that the distributed extensibility provided by XML Namespaces is indispensable for a Web-scale data interchange technology. The JSON approach of just ignoring keys you don't understand can get you a long way, but I don't think it scales.

2007-04-06

Do we need a new kind of schema language?

What's the problem?

I see the real pain-point for distributed computing at the moment as not the messaging framework but the handling of the payload. A successful distributed computing platform needs

  • a payload format
  • a way to express a contract that a payload must meet
  • a way to process a payload that may conform to one or more contracts

that is

  • suitable for average, relatively low-skill programmers, and
  • allows for loose coupling (version evolution, extensibility, suitability for a wide variety of implementation technologies).

For the payload format, XML has to be the mainstay, not because it's technically wonderful, but because of the extraordinary breadth of adoption that it has succeeded in achieving. This is where the JSON (or YAML) folks are really missing the point by proudly pointing to the technical advantages of their format: any damn fool could produce a better data format than XML.

We also have to live in a world where XSD is currently dominant as the wire-format for the contract (thank you, W3C, Microsoft and IBM).

But I think it's fairly obvious that current XML/XSD databinding technologies have major weaknesses when considered as a solution to problem of payload processing for a distributed computing platform. The two basic databinding techniques I see today are:

  • Generating XSD from an implementation in a statically typed language which includes optional annotations; this provides a great developer experience, but from a coupling perspective doesn't seem much of an improvement beyond CORBA or DCOM. The other problem is that it's tough to do this in a dynamically typed language (absent sophisticated type inference or mandatory annotations).
  • Generating programming language stubs from an XSD which includes optional annotations. This is problematic from the developer experience point of view: there's a mismatch between XML's fundamental structures, attributes and elements, which are optimized for imposing structure on text, and the terms in which developers naturally think of data structures. Beyond this inherent problem, it's hard to author schemas using XSD and even harder to author schemas that have the right loose-coupling properties. And the tooling often introduces additional coupling problems.

This pain is experienced most sharply at the moment in the SOAP world, because the big commercial players have made a serious investment in trying to produce tools that work for the average developer. But I believe the REST world has basically the same problem: it's not really feeling the pain at the moment because REST solutions are mostly created by relatively elite developers who are comfortable dealing with XML directly.

The REST world also takes a less XML-centric view of the world, but for non-XML payload formats (JSON, or property-value pairs) its only solution to the contract problem is a MIME type, which I think is totally insufficient as a contract mechanism for enterprise-quality distributed computing. For example, it's not enough to say "accessing this URI will give you JSON"; there needs to be a description of the structure of the JSON, and that description needs to be machine readable.

Some people propose solving the XML-processing problem by adopting an XML-centric processing model, for which the leading technologies are XQuery and XSLT2. The fundamental problem here is the XQuery/XPath data model. I'm not criticizing the WGs' efforts: they've done about as good a job as could be done given the constraints they were working under. But there is no way it can overcome the constraint that a data model based around XML and XSD is just not very good data model for general-purpose computing. The structures of XML (attributes, elements and text) are those of SGML and these come from the world of markup. Considered as general purpose data structures, they suck pretty badly. There's a fundamental lack of composability. Why do we need both elements and attributes? Why can't attributes contain elements? Why is the type of thing that can occur as the content of an element not the same as the type of thing that can occur as a document? Why do we still have cruft like processing instructions and DTDs? XSD makes a (misguided in my view) attempt to add a OO/programming language veneer on top. But it can't solve the basic problems, and, in my view, this veneer ends up making things worse not better.

I think there's some real progress being made in the programming language world. In particular I would single out Microsoft's LINQ work. My doubts on this are with its emphasis on static typing. While I think static typing is a invaluable within a single, controlled system, I think for a distributed system the costs in terms of tight coupling often outweigh the benefits. I believe this is less of the case if the typing is structural rather than nominal. But although LINQ (or at least newer versions of C#) have introduced some welcome structural typing features, nominal typing is still thoroughly dominant.

In the Java world, there's been a depressing lack of innovation at the language level from Sun; outside of Sun, I would single out Scala from EPFL (which can run on a JVM). This adds some nice functional features which are smoothly integrated with Java-ish OO features. XML is fundamentally not OO: XML is all about separating data from processing, whereas OO is all about combining data and processing. Functional programming is a much better fit for XML: the problem is making it usable by the average programmer, for whom the functional programming mindset is very foreign.

A possible solution?

This brings me to the main point I want to make in this post. There seems to me to be another approach for improving things in this area, which I haven't seen being proposed (maybe I just haven't looked in the right places). The basic idea is to have a schema language that operates at a different semantic level. In the following description I'll call this language TEDI (Type Expressions for Data Interchange, pronounced "Teddy"). This idea is very much at the half-baked stage at the moment. I don't claim to have fully thought it through yet.

If you look at the major scripting languages today, I think it's striking that at a very high level, their data structures are pretty similar and are composed from:

  • arrays
  • maps
  • scalars/primitives or whatever you want to call them

This goes for Perl, Python, Ruby, Javascript, AWK. (PHP's array datastructure is a little idiosyncratic.) The SOAP data model is also not dissimilar.

When you drill down into the details, there are of course lots of differences:

  • some languages have fixed-length tuples as well as variable-length arrays
  • most languages distinguish between a struct that has a fixed set of identifiers as keys and a map that can have an unlimited set keys (though there are often restrictions on the types of keys, for example, to prohibit mutable types)
  • there's a wide variety of primitives: almost all languages have strings (though they differ in whether they are mutable) and numbers; beyond that, many languages have booleans, a null value, some sort of date-time support

TEDI would be defined in terms of a generic data model that makes a tasteful restricted choice from these programming languages' data structures: not limiting the choice to the lowest common denominator, but leaving our frills and focusing on the basics and on things that be naturally mapped into each language. At least initially, I think I would restrict TEDI to trees rather than handle general graphs. Although graphs are important, I think the success of JSON shows that trees are good enough as a programmer-friendly data interchange mechanism.

I would envisage both an XML and a non-XML syntax for TEDI. The non-XML syntax might have JSON flavour. For example, a schema might look like this:

   { url: String, width: Integer?, height: Integer?, title: String? }

This would specify a struct with 4 keys: the value of the "url" key is a string; the value of the "width" key is a string or null. You can thus think of the schema as being a type expression for a generic scripting language data structure.

The key design goal for TEDI something would be to make it easy and natural for a scripting-language programmer to work with.

There's one other big piece that's needed to make TEDI work: annotations. Each component of a TEDI schema can have multiple, independent annotations, which may be inline or externally attached in some way. Each annotation has a prefix that identifies a binding. A TEDI binding specification has to be developed for each programming language and each serialization that will be used with TEDI.

The most important TEDI binding specification would be the one for XML. This specifies for a combination of a

  • a TEDI schema,
  • XML binding annotations for the TEDI schema, and
  • an instance of the generic TEDI data model conforming to the schema

which XML infosets are considered correct representations of the instance, and also identifies one of these infosets as the canonical representation. The XML binding annotations should always be optional: there should be a default XML serialization of any TEDI instance.

For example, an instance of the example schema above might get serialized as

<root>
<url>http://www.example.com/pic.jpg</url>
<title>A fine picture</title>
</root>

But with an annotation

  @xml.element(name="picture")
{ url: String, width: Integer?, height: Integer?, title: String? }

it might get serialized as

<picture>
<url>http://www.example.com/pic.jpg</url>
<title>A fine picture</title>
</picture>

Let's try and make this more concrete by imagining what it would look like for a particular scripting language, say Python. First of all people in the Python community would need to get together to create a TEDI binding for Python. This would work in an analogous way to the XML binding. It would specify for a combination of a

  • a TEDI schema,
  • Python binding annotations for the TEDI schema, and
  • an instance of the generic TEDI data model conforming to the schema

which Python data structures are considered representations of the instance, and also identify one of these data structures as the canonical representation.

The API would be very simple. You would have a TEDI module that provided functions to create schema objects in various ways. The simplest way would be to create it from a string containing the non-XML representation of the TEDI schema complete with any inline annotations Any XML and Python annotations would be used; annotations from other bindings would be ignored. The schema object would provide two fundamental operations:

  • loadXML: this takes XML and returns a Python structure, throwing an exception if the XML is not valid according to the TEDI schema
  • saveXML: this take a Python structure and returns/outputs XML, throwing an exception if the Python structure is not valid according to the schema

XML is not the only possible serialization. The JSON community could develop a JSON binding. If you implemented that, then your API would have loadJSON and saveJSON methods as well.

One complication that must be handled in order to make this industrial-strength is streaming. A good first step would be to able to handle the pattern where the document element contains zero or more header elements, and then a possibly very large number of entry elements, each of which is not large; the streaming solution you would want in this case is for the API to deliver the entries as an iterator rather than an array.

Another challenge in designing the TEDI XML binding is handling extensibility. I think the key here is for one of the TEDI primitives to be an XmlElement (or maybe XmlContent). (This might also be useful in dealing with XML mixed content.) With different TEDI schemas you should be able to get quite different representations out of the same XML document. For a SOAP message, you might have a very generic TEDI schema that represents it as an array of headers and a payload (all being XmlElements); or you might have a TEDI schema for a specific type of message that represented the payload as a particular kind of structure.

This shows how you could fit TEDI into a world where XML is the dominant wire format, but still leverage other more suitable wire formats when appropriate.

But how do you interop with a world that uses XSD as the wire format for contracts? The minimum is to create a tool that can take a TEDI schema with XML annotations and generate an XSD. There'll be limits because of the limited power of XSD (and these will need to be taken into consideration in designing the TEDI XML binding): some of the constraints of the TEDI schema might not be captured by the XSD. But that's a normal situation: there are often complex constraints on an XML document being interchanged that cannot be expressed in XSD.

A more difficult task is to take an XSD and generate a TEDI together with XML binding annotations. This would be one of the main things that would drive adding complexity to the TEDI XML binding annotations. I expect that the work of the XML Schema Patterns for Databinding WG would be valuable input on what was really needed.

In the future, there's still hope that the wire-format for the contract need not always be XSD: WSDL 2.0 makes a significant effort not to restrict itself to XSD; so you could potentially publish a WSDL with both the XSD and the TEDI for a web service.

The closest thing I've seen to TEDI is Paul Prescod's XBind language, but it has a rather different philosophy in that it separates validation from data binding, whereas TEDI integrates them. Another difference is that Paul has written some code, whereas TEDI is completely vaporware at this point.

I'm going to use subsequent posts to try to develop the design of TEDI to the point where it could be implemented; at the moment it's not developed enough to know whether it really holds water. If you find the idea interesting, please help with the design process by using comments to give feedback. I promise to try to keep future posts shorter, but I wanted my first real post to have a bit of meat to it.