2007-04-06

Do we need a new kind of schema language?

What's the problem?

I see the real pain-point for distributed computing at the moment as not the messaging framework but the handling of the payload. A successful distributed computing platform needs

  • a payload format
  • a way to express a contract that a payload must meet
  • a way to process a payload that may conform to one or more contracts

that is

  • suitable for average, relatively low-skill programmers, and
  • allows for loose coupling (version evolution, extensibility, suitability for a wide variety of implementation technologies).

For the payload format, XML has to be the mainstay, not because it's technically wonderful, but because of the extraordinary breadth of adoption that it has succeeded in achieving. This is where the JSON (or YAML) folks are really missing the point by proudly pointing to the technical advantages of their format: any damn fool could produce a better data format than XML.

We also have to live in a world where XSD is currently dominant as the wire-format for the contract (thank you, W3C, Microsoft and IBM).

But I think it's fairly obvious that current XML/XSD databinding technologies have major weaknesses when considered as a solution to problem of payload processing for a distributed computing platform. The two basic databinding techniques I see today are:

  • Generating XSD from an implementation in a statically typed language which includes optional annotations; this provides a great developer experience, but from a coupling perspective doesn't seem much of an improvement beyond CORBA or DCOM. The other problem is that it's tough to do this in a dynamically typed language (absent sophisticated type inference or mandatory annotations).
  • Generating programming language stubs from an XSD which includes optional annotations. This is problematic from the developer experience point of view: there's a mismatch between XML's fundamental structures, attributes and elements, which are optimized for imposing structure on text, and the terms in which developers naturally think of data structures. Beyond this inherent problem, it's hard to author schemas using XSD and even harder to author schemas that have the right loose-coupling properties. And the tooling often introduces additional coupling problems.

This pain is experienced most sharply at the moment in the SOAP world, because the big commercial players have made a serious investment in trying to produce tools that work for the average developer. But I believe the REST world has basically the same problem: it's not really feeling the pain at the moment because REST solutions are mostly created by relatively elite developers who are comfortable dealing with XML directly.

The REST world also takes a less XML-centric view of the world, but for non-XML payload formats (JSON, or property-value pairs) its only solution to the contract problem is a MIME type, which I think is totally insufficient as a contract mechanism for enterprise-quality distributed computing. For example, it's not enough to say "accessing this URI will give you JSON"; there needs to be a description of the structure of the JSON, and that description needs to be machine readable.

Some people propose solving the XML-processing problem by adopting an XML-centric processing model, for which the leading technologies are XQuery and XSLT2. The fundamental problem here is the XQuery/XPath data model. I'm not criticizing the WGs' efforts: they've done about as good a job as could be done given the constraints they were working under. But there is no way it can overcome the constraint that a data model based around XML and XSD is just not very good data model for general-purpose computing. The structures of XML (attributes, elements and text) are those of SGML and these come from the world of markup. Considered as general purpose data structures, they suck pretty badly. There's a fundamental lack of composability. Why do we need both elements and attributes? Why can't attributes contain elements? Why is the type of thing that can occur as the content of an element not the same as the type of thing that can occur as a document? Why do we still have cruft like processing instructions and DTDs? XSD makes a (misguided in my view) attempt to add a OO/programming language veneer on top. But it can't solve the basic problems, and, in my view, this veneer ends up making things worse not better.

I think there's some real progress being made in the programming language world. In particular I would single out Microsoft's LINQ work. My doubts on this are with its emphasis on static typing. While I think static typing is a invaluable within a single, controlled system, I think for a distributed system the costs in terms of tight coupling often outweigh the benefits. I believe this is less of the case if the typing is structural rather than nominal. But although LINQ (or at least newer versions of C#) have introduced some welcome structural typing features, nominal typing is still thoroughly dominant.

In the Java world, there's been a depressing lack of innovation at the language level from Sun; outside of Sun, I would single out Scala from EPFL (which can run on a JVM). This adds some nice functional features which are smoothly integrated with Java-ish OO features. XML is fundamentally not OO: XML is all about separating data from processing, whereas OO is all about combining data and processing. Functional programming is a much better fit for XML: the problem is making it usable by the average programmer, for whom the functional programming mindset is very foreign.

A possible solution?

This brings me to the main point I want to make in this post. There seems to me to be another approach for improving things in this area, which I haven't seen being proposed (maybe I just haven't looked in the right places). The basic idea is to have a schema language that operates at a different semantic level. In the following description I'll call this language TEDI (Type Expressions for Data Interchange, pronounced "Teddy"). This idea is very much at the half-baked stage at the moment. I don't claim to have fully thought it through yet.

If you look at the major scripting languages today, I think it's striking that at a very high level, their data structures are pretty similar and are composed from:

  • arrays
  • maps
  • scalars/primitives or whatever you want to call them

This goes for Perl, Python, Ruby, Javascript, AWK. (PHP's array datastructure is a little idiosyncratic.) The SOAP data model is also not dissimilar.

When you drill down into the details, there are of course lots of differences:

  • some languages have fixed-length tuples as well as variable-length arrays
  • most languages distinguish between a struct that has a fixed set of identifiers as keys and a map that can have an unlimited set keys (though there are often restrictions on the types of keys, for example, to prohibit mutable types)
  • there's a wide variety of primitives: almost all languages have strings (though they differ in whether they are mutable) and numbers; beyond that, many languages have booleans, a null value, some sort of date-time support

TEDI would be defined in terms of a generic data model that makes a tasteful restricted choice from these programming languages' data structures: not limiting the choice to the lowest common denominator, but leaving our frills and focusing on the basics and on things that be naturally mapped into each language. At least initially, I think I would restrict TEDI to trees rather than handle general graphs. Although graphs are important, I think the success of JSON shows that trees are good enough as a programmer-friendly data interchange mechanism.

I would envisage both an XML and a non-XML syntax for TEDI. The non-XML syntax might have JSON flavour. For example, a schema might look like this:

   { url: String, width: Integer?, height: Integer?, title: String? }

This would specify a struct with 4 keys: the value of the "url" key is a string; the value of the "width" key is a string or null. You can thus think of the schema as being a type expression for a generic scripting language data structure.

The key design goal for TEDI something would be to make it easy and natural for a scripting-language programmer to work with.

There's one other big piece that's needed to make TEDI work: annotations. Each component of a TEDI schema can have multiple, independent annotations, which may be inline or externally attached in some way. Each annotation has a prefix that identifies a binding. A TEDI binding specification has to be developed for each programming language and each serialization that will be used with TEDI.

The most important TEDI binding specification would be the one for XML. This specifies for a combination of a

  • a TEDI schema,
  • XML binding annotations for the TEDI schema, and
  • an instance of the generic TEDI data model conforming to the schema

which XML infosets are considered correct representations of the instance, and also identifies one of these infosets as the canonical representation. The XML binding annotations should always be optional: there should be a default XML serialization of any TEDI instance.

For example, an instance of the example schema above might get serialized as

<root>
<url>http://www.example.com/pic.jpg</url>
<title>A fine picture</title>
</root>

But with an annotation

  @xml.element(name="picture")
{ url: String, width: Integer?, height: Integer?, title: String? }

it might get serialized as

<picture>
<url>http://www.example.com/pic.jpg</url>
<title>A fine picture</title>
</picture>

Let's try and make this more concrete by imagining what it would look like for a particular scripting language, say Python. First of all people in the Python community would need to get together to create a TEDI binding for Python. This would work in an analogous way to the XML binding. It would specify for a combination of a

  • a TEDI schema,
  • Python binding annotations for the TEDI schema, and
  • an instance of the generic TEDI data model conforming to the schema

which Python data structures are considered representations of the instance, and also identify one of these data structures as the canonical representation.

The API would be very simple. You would have a TEDI module that provided functions to create schema objects in various ways. The simplest way would be to create it from a string containing the non-XML representation of the TEDI schema complete with any inline annotations Any XML and Python annotations would be used; annotations from other bindings would be ignored. The schema object would provide two fundamental operations:

  • loadXML: this takes XML and returns a Python structure, throwing an exception if the XML is not valid according to the TEDI schema
  • saveXML: this take a Python structure and returns/outputs XML, throwing an exception if the Python structure is not valid according to the schema

XML is not the only possible serialization. The JSON community could develop a JSON binding. If you implemented that, then your API would have loadJSON and saveJSON methods as well.

One complication that must be handled in order to make this industrial-strength is streaming. A good first step would be to able to handle the pattern where the document element contains zero or more header elements, and then a possibly very large number of entry elements, each of which is not large; the streaming solution you would want in this case is for the API to deliver the entries as an iterator rather than an array.

Another challenge in designing the TEDI XML binding is handling extensibility. I think the key here is for one of the TEDI primitives to be an XmlElement (or maybe XmlContent). (This might also be useful in dealing with XML mixed content.) With different TEDI schemas you should be able to get quite different representations out of the same XML document. For a SOAP message, you might have a very generic TEDI schema that represents it as an array of headers and a payload (all being XmlElements); or you might have a TEDI schema for a specific type of message that represented the payload as a particular kind of structure.

This shows how you could fit TEDI into a world where XML is the dominant wire format, but still leverage other more suitable wire formats when appropriate.

But how do you interop with a world that uses XSD as the wire format for contracts? The minimum is to create a tool that can take a TEDI schema with XML annotations and generate an XSD. There'll be limits because of the limited power of XSD (and these will need to be taken into consideration in designing the TEDI XML binding): some of the constraints of the TEDI schema might not be captured by the XSD. But that's a normal situation: there are often complex constraints on an XML document being interchanged that cannot be expressed in XSD.

A more difficult task is to take an XSD and generate a TEDI together with XML binding annotations. This would be one of the main things that would drive adding complexity to the TEDI XML binding annotations. I expect that the work of the XML Schema Patterns for Databinding WG would be valuable input on what was really needed.

In the future, there's still hope that the wire-format for the contract need not always be XSD: WSDL 2.0 makes a significant effort not to restrict itself to XSD; so you could potentially publish a WSDL with both the XSD and the TEDI for a web service.

The closest thing I've seen to TEDI is Paul Prescod's XBind language, but it has a rather different philosophy in that it separates validation from data binding, whereas TEDI integrates them. Another difference is that Paul has written some code, whereas TEDI is completely vaporware at this point.

I'm going to use subsequent posts to try to develop the design of TEDI to the point where it could be implemented; at the moment it's not developed enough to know whether it really holds water. If you find the idea interesting, please help with the design process by using comments to give feedback. I promise to try to keep future posts shorter, but I wanted my first real post to have a bit of meat to it.

21 comments:

Scott Hudson said...

Welcome to the blogosphere, James! I find your TEDI proposal interesting. I assume this will be geared primarily towards data and web services, not documents, correct? I'm surprised that there is no mention of RelaxNG. How do you think RelaxNG fits into this picture if at all?

Having done a number of DocBook v5 customizations in RNC, I am completely sold on RelaxNG for its elegance and ease of use. It sounds like TEDI may be similar in that respect. My main issue right now, is the lack of tools that support RelaxNG. I wonder if TEDI will run into some of the same issues?

Anonymous said...

Babel.

tony525 said...

Interesting; wish this type of thinking was around a few years ago.

Though at this point, I have to be reminded why I need schemas...I seem to not need them...ok ok i know I do.

cheers, JF

Unknown said...

James,

I'm glad to see you're blogging - should prove enlightening for many of us.

The central problem that JSON faces is its inability to handle multiple identical keys that have an implicit order - i.e., there's no equivalent to

<div>
<a>foo</a>
<b>bar</b>
<a>bat</a>
</div>

that doesn't introduce semantics into the assertions, which in turn implies that JSON structures can be rendered as XML, but not vice versa in the general case.

It's one of the reasons that I see a proposal such as yours to be so interesting, because it may in fact also provide the impetus for establishing a JSONic representation of XML; I see that happening already with compact RNG notation as well as the compact formalism of XQuery.

Of course if this continues we'll all be programming in Haskell by 2020 ;-)

Good luck with TEDI. I'm going to follow it closely.

Anonymous said...

This looks very interesting indeed, but I would not say that this has not been proposed before. The TEDI approach looks to me what others have been working on while calling it "a conceptual model for XML". I would argue that it is the same idea: to come up with a model/schema that maps well to XML structures, but is not as overloaded with the exact markup structure and design as XSD.

Such a model not only would be much more pleasant to work with than the markup-centric XSD, it would also lend itself naturally to different markup mappings through annotations in the model. Furthermore, it could be used to perform model mapping in a much more sophisticated way than the current markup-centric hand-crafted XSLT transformations which tend to break easily when anything in the mapped schemas changes.

The idea of a conceptual model has a lot of appealing characteristics, but it has some cultural problems:

- For markup people (in particular people brought up on a strict diet of DTDs), the idea of some model at a semantic layer above DTDs or XSDs is not very natural.

- For software people, models are in UML or similar languages, and XSD is simply a language for describing one possible serialization of such a model. Having a model which better reflects XML-style structures is not very natural.

For the past years, I have been surprised by how little attention the idea of a conceptual model for XML has attracted, and I think it is because of the dilemma outlined above, which means that the idea is somehow stuck between a rock and a hard place.

But the more data is described in XSD and the more people have to deal with systems exposing their data as XML documents, the more it will become apparent that XSD is not a good modeling language, despite its efforts to have some modeling-like features. I think that at some point in time the pain of having to work with large-scale information integration based on XSDs will just become too bad, and then something will happen.

I think that none of the approaches for conceptual Models for XML so far have been convincing, but in this area there is a great opportunity to do something that eventually will become the ER model of the XML world. Or the UML. Whatever. Something that will guide the mental model of how people think about XML.

Here are two surveys of the field:

Martin Nečaský, Conceptual Modeling for XML: A Survey, In: Václav Snášel, Karel Richta, and Jaroslav Pokorný (Ed.), Proceedings of the DATESO 2006 Annual International Workshop on Databases, Texts, Specifications, and Objects, Desná — Černá Říčka, Czech Republic, April 2006, ISBN 80-248-1025-5.

Arijit Sengupta and Erik Wilde, The Case for Conceptual Modeling for XML, Computer Engineering and Networks Laboratory, ETH Zürich, Zürich, Switzerland, TIK Report No. 244, February 2006.

Anonymous said...

Try again.

This is good. The VOS-list people need to see this. It could help them as they are still in an early phase of development. I copied the Robin Cover synopsis to the VRML/X3D lists but it's a bit late there. The XSD for what was originally a fairly clean syntax and language got muddy in the XMLization.

len bullard

Anonymous said...

interesting post and makes me wonder why nobody ever tried (?!) to develop a schema-by-example language for XML? for example:

<root>

<url>xs:uri</url>

<width>xs:int</width>?

<height>xs:int</height>?

<title>xs:string</title>?

</root>

if it stayed as simple as that it may cover pretty nicely 80/20 sweet spot ..

DRRW said...

James, good to re-connect.

When we worked on the RELAX NG together - I kept on about this all - but noone "got it". The good news is that the OASIS CAM work fits solidly into this arena - and is available with a good OSS implementation - see http://www.jcam.org.uk

We would love to have input for CAM v2.0 - the <Extension> mechanism we have accommodates all creativity!!

What we have on the "to do" already is XPath 2 aligning - and then auto-import from XSD / RELAX NG for base typing and cardinality rules.

What we have right now though I beleive empowers people to d much of what you are wanting here.

Cheers, DRRW

DRRW said...

Mentioning alternative XSD syntax - and going VERY retro - this is the original eDTD stuff I produced way back when - searching for simpler xsd but better dtd functionality . Combining some of these ideas with JSON (sidebar - JSON just shorthand for fragments for use in AJAX so no heavy parser required - so not supposed to be all-purpose) may be worth revisiting:

http://xml.coverpages.org/bizcodes-edtd-xml.txt

???

Put this down to - the more things change - the more they stay the same...

Anonymous said...

Shameless plug:

http://laurentszyster.be/jsonr/

Regular JSON (or JSONR) is a simple protocol to specify practical patterns for network object models: null, true or false, integer, double and decimal, irregular and regular strings, numeric ranges around and from zero, collections, relations, dictionnaries and namespaces.

bblfish said...

why not just use rdf?

Anonymous said...

I'm sorry but I don't really get this. Is this proposal about Yet Another Mashalling Format, layered on top of a restricted set of XML? Another way to attack the problem is by explicitly specifying universal primitives, such as BigInt, DateTime with defined ranges, etc. This would get around half the problems of SOAP interoperability. This will also make it easier to address issues of Nullability in languages where it is not explicitly supported. Another way is to align the primitives closer to RDBMS primitives, since practically, most interop effort is going to involve marshalling RDBMS data.

Unknown said...

James,

Excellent. I've had similar noodlings for a simple data-binding-director feature in Amara [1] 2.0, but my thoughts were even less baked than yours. As such, I'll be watching TEDI's development closely, as inspiration for my own Python/XML binding work.

Aleksander,

For schema-by-example, see Examplotron [2]. And yes, examplotron could pretty easily be a vehicle for someting TEDI-like. Eric van der Vlist, are ya listening? :-)

Erik,

I think the basic idea is clearly not new, but I also think there has been little in the way of a coherent but non-academic explanation and demonstration. I also think that thinking in terms of XML *and* JSON provides a ouch of needed discipline, and finally that the timing is right for something like TEDI whereas a few years ago, things just hadn't matured to the point where it could take hold. That comment goes for David Webber as well. I think it's a waste of time to have an "I'm so cool, I thought of it first" competition.

As for "Why not use RDF", that's a very incomplete idea. Sure you can express annotations in RDF, but RDF does not provide a mechanism for binding to markup, which is the *entire* point here. BTW, I've been hoping RDF would grow such a mechanism for ages, so I hope this provides a kick. James's sample annotations could easily be reduced to triples if you like, but that would just be a trifle baked with the easy bits, and it wouldn't help with any of the hard problems.

Unknown said...

Oops. Missing from comment #13:

[1] http://uche.ogbuji.net/tech/4suite/amara/
[2] http://www-128.ibm.com/developerworks/xml/library/x-xmptron/

Anonymous said...

Interesting; The idea of a conceptual model has a lot of problems , but it works

Anonymous said...

"Interesting; wish this type of thinking was around a few years ago"

Me too.
For all the work I've done, json / xmlrpc have been much closer to the sweet spot that any of the soap/xsd mess.

Sure there are quirks to both:
- xmlrpc misses NULL (can be fixed by extending the language)
- json misses distinction between int and float, which are separate types on most rdbms/languages (can be fixed by mandating floats to be serialized with a .0)
- no std way to express cyclic references
- both miss a standard definition language
- etc...

Aside from that, I have never seen all the other xml "goodies" used in xml-to-code mappings. The example kurt puts forward is in fact an anti-pattern to me: it would not map into a native struct in most current languages.

The secret for universal interop is ,imho , to leave part of the job to the application layer, and not try to cram everything into the serialization format + communication pattern. The reason is: different apps have different business rules / requirements, so most of the high level facilities put in the framework/toolkit will be re-coded by developers anyway.


Maybe an adaptation of RelaxNG to JSON would be a valid alternative to TEDI...

tobe said...

James,

I think we do need a different kind of schema language. I think we also need to go back to basics because XML is moving in the wrong direction when people create schemas defining elements like "property" containing elements "name" and "value". I mean, the tag is already a name and the tag has a value.

I think namespaces should be hierarchical, i.e. if an element is defined as only existing inside another element, that should be reflected as belonging to a sub-namespace.

Also I think the schema needs to show relationships rather than hierarchies. I mean if you submit a "player" element with a "team" element inside or a "team" element with a "player" element inside, you can submit exactly the same information if you wish, but one would make more sense than the other for that particular exchange.

I have made an attempt at this in http://tobe.homelinux.net/xis

Anonymous said...

Interesting idea. But the name TEDI probably should be changed because the acronym is already used in the world of EDI. -FYI

Anonymous said...

My halfly-cooked work called ooRelaxNG may be relevant. However, I have never thought about introducing arrays or other data structures.

Rob said...

I don't get it.
I'm sure I'm betraying my innocence in asking this, but what problem are you guys trying to solve?

Anonymous said...

different thoughts:

1. Just get behind the DSDL work, datatyping seems to me to be adequately solved by DTLL http://dsdl.org/dsdl-5.pdf

2. I thought it was unclear if you were suggesting that less skilled developers should be able to define their schemas? Sure I guess, but I think it should involve the same amount of pain of doing a database schema, which means that it should be handled by someone who understands that area if you're doing something important that must be reused in the future. Which means maybe slightly less pain than with XSD.

3. Are you breaking your own dictum of a markup format should do one thing?. You're combining data type description and structural validation it seems to me. As for the needs of REST schemas for data interchange, I think a lot of the pain there might involve describing the relations between different URIs in the REST application and the meaning of the data returned. Someone said why not just use RDF, anti-RDF as I am there's nothing right now that seems to address this area.