2007-04-09

XML and JSON

I've had some useful feedback on my previous post. I need to take a few days to get clearer in my own mind what exactly it is that I'm trying to achieve and find a crisper way to describe it. In the meantime, I would like to offer a few thoughts about XML and JSON. My previous post came off much too dismissive of JSON. I actually think that JSON does have real value. Some people focus on the ability for browsers to serialize/deserialize JSON natively. This makes JSON an attractive choice for AJAX applications, and I think this has been an important factor in jump starting JSON adoption. But in the longer term, I think there are two other aspects of JSON that are more valuable.
  • JSON is really, really simple, and yet it's expressive enough for many applications. When XML 1.0 came out it represented a major simplification relative to what it aspired to replace (SGML). But over the years, complexity has accumulated, and there's been very little attention given to simplifying and refactoring the XML stack. The result is frankly a mess. For example, it's nonsensical to have DTD defaulting of attributes based on prefix rather than namespace name, yet this is a feature that any conforming XML parser has to implement. It's not surprising that XML is unappealing to a generation of programmers who are coming to it fresh, without making allowances for how it got to be this way. When you look at the bang for the buck provided by XML and compare it with JSON, XML does not look good. The hard question is whether there's anything the XML community can do to improve things that can overcome the inertia of XML's huge deployed base. The XML 1.1 experience is not encouraging.
  • The data model underlying JSON (atomic datatypes, objects/maps, arrays/lists) is a much more natural model for data than an XML infoset. If you're working in a scripting language and read in some JSON, you directly get something that's quite pleasant to work with; if you read in XML, you typically get some DOM-like structure, which is painful to work with (although a bit of XPath can ease the pain), or you have to apply some complex data-binding machinery.
However, I don't think JSON will or should relegate XML to a document-only technology.
  • You can't partition the world of information neatly into documents and data. There are many, many cases where information intended for machine-processing has parts which are intended for human consumption. GData is a great example. The GData APIs handle this in JSON by having strings with HTML/XML content. However, I think the ability of XML to handle documents and data in a uniform way is a big advantage for information of this type.
  • XML's massive installed base gives it an interoperability advantage over any competitive technology. (Unfortunately this also applies to any future cleaned up version of XML.) You don't get the level of adoption that XML has achieved without cost. It requires multiple communities with different objectives to come together and compromise; each community ends up accepting features that are unnecessary cruft from its point of view. This level of adoption also takes time and requires a technology to grow to support new requirements. Adding new features while preserving backwards compatibility often results in a less than elegant design.
  • A range of powerful supporting technologies have been developed for XML. Naturally I have a fondness for the ones that I had a role in developing: XPath, XSLT, RELAX NG. I also can see a lot of value in XPath2, XSLT2 and XQuery. On some days, if I'm in a particularly good mood and I try really hard, I can see value in XSD. Programming languages are more and more acquiring built-in support for XML. Collectively I think these technologies give XML a huge advantage.
  • JSON's primitive datatype support is weak. The semantics of non-integer numbers are unspecified. In XSD terms, are they float, double, decimal or precisionDecimal? Some important datatypes are missing. In particular, I think support for binary data (XSD base64Binary or hexBinary) is critical. Furthermore, the set of primitive datatypes is not extensible. The result is that JSON strings end up being used to encode data that is not logically a string. JSON solves the datatyping problem only to the extent the only non-string datatypes you care about are booleans and integers.
  • JSON does not have anything like XML Namespaces. There are probably many people who see this as an advantage for JSON and certainly XML Namespaces come in for a lot of criticism. However, I'm convinced that the distributed extensibility provided by XML Namespaces is indispensable for a Web-scale data interchange technology. The JSON approach of just ignoring keys you don't understand can get you a long way, but I don't think it scales.

20 comments:

Anonymous said...

JSON can come with a namespace: the URI that you fetched the JSON document from.

Unknown said...

And og course JSON can just be considered to be a subset of YAML, have you taken a look at YAML?

http://yaml.org/

Kohsuke Kawaguchi said...

> On some days, if I'm in a
> particularly good mood and I try
> really hard, I can see value in
> XSD

:-)

I hate XSD as much as you do, but I think there are pieces that you can see values more easily. Datatypes, for example.

Anonymous said...

Just to expand on what alan mentioned, YAML, in addition to being a superset of JSON, has been built with language-portability in mind. It has a well defined semantics separate from it's serialization and parsing. There are two aspects of the YAML semantics that make it particularly well suited for data exchange.

1. YAML documents are a directed graph which lowers the impedance of mapping YAML data to object-oriented languages. Although the graph must contain a root node, this poses few constraints that will be new to programmers working with OO based languages (Java object serialization works the same way.)

2. YAML nodes may be tagged enabling a data-typing beyond the basic types. A TEDI-like binding system could quiet easily be defined on top of this, but it's almost unnecessary. YAML applications can interpret a typed struct they don't understand as a Map if they choose. There's also been some work on a schema language and validator for YAML.

As you mentioned previously, XML is here for a while simply due to the breadth of adoption. However, YAML is a pragmatic example of how to engineer a data exchange system that supports simply mapping to different language types. It's also great option if you're looking for a JSON++.

Oh, YAML also supports stream-parsing and has less draconian error handling.

Anonymous said...

James -

I wrote a paper on "Refactoring XML"
(http://idealliance.org/papers/dx_xmle04/papers/04-03-02/04-03-02.html - yikes, the formatting here has lost the newlines), which proposed de-cluttering the XML data model to achieve something of what I think you're after (and this paper quoted you a lot!). When I gave it at XML Europe 2004, I got a resounding raspberry from some of the XML luminaries in attendance ("worst idea of the entire conference").

To have any chance of widespread adoption the kind of effort you outline would probably need to be serialisable as XML and (so) work with XML APIs. It'd also need to handle mixed content, IMO.

XML needs to have done to it what SMGL got done to it by XML :-)

This is something I'd like see on a standards track ...

M. David Peterson said...

@roberthahn,

re: "JSON can come with a namespace: the URI that you fetched the JSON document from."

I can understand this line of thinking: If what I am after is the ability to discern where a dataset came from, from a mashup mentality this does present a nice way to distinguish what is what.

That said, namespaces in XML provide quite a bit more functionality than just the ability to "locate" it's origin, or apply a GUID to a dataset. For example, take a look @ the base transformation file for transforming Atom feed files from various source locations into XHTML.

With namespaces in XML I have the ability to take an Atom feed (or any XML infoset for that matter) from any source on the planet, regardless of its originating location, and know that if the feed is bound to the namespace allocated for an Atom feed, and that Atom feed conforms to the Atom Synication Format RNG schema then I can confidently transform that data into XHTML, knowing that in doing so, the chances are pretty high that the result will be what I expect it to be.

This is where JSON falls short: The "contract first" capabilities of XML+Namespaces+Pick-your-prefered schema language provide an extensive amount of capability, something of which JSON doesn't have an answer for.

Anonymous said...

@M. David Peterson

100% in agreement. As I'm sure you know, the issue of transformation integrity doesn't really matter in JSON because it lacks a key characteristic: the ability to specify different languages (dare I use DSL?) over the same syntax. If you need a non-JSON representation of the data, then you can choose your favourite templating engine in your favourite language to make the transformation.

Transformation integrity may be somewhat sacrificed (at least, it's a lot more work to preserve it), but there's no denying that it's easier to process the data once your JSON library translates the payload into your language's native data structure.

It doesn't make sense to apply XML thinking to JSON. If you do, you'll find all kinds of reasons why JSON doesn't live up to expectations.

Brantley Harris said...

More and more I've grown to loathe XML. Most of the time I just want to transfer data from one language to another. For that, JSON can't be beat. XML, basically embeds contextual information in the data; that's absolutely antithetical to how I think: context should be squarely in the realm of the code.

Anonymous said...

I would agree that there would be a huge barrier to overcome to 'revise' XML, probably to the point that creating a brand new language with a different name and next to none of the baggage would be appropriate.

George A. Maney said...

JSON is OK, but punctuation is hardly the problem. Note that Internet document web is information unsafe. Even so, it works great. The internet data web is different. Any workable, worthwhile Internet data web must be intrinsically information safe.
All Internet data web architectures so far start with metaphysical information modeling architectures. Today all mainstream information modeling is metaphysical. Entity-relationship and object-oriented are the predominant forms. RDF and all alternatives are just alternative flavors of metaphysical pattern modeling.
Metaphysical modeling is intrinsically low quality and thus intrinsically unsafe. This is readily demonstrable. So metaphysical information modeling mashup interoperability, insurability, and immortality cannot be modeled or managed. This is a killer. It eliminates nearly all customer value potential in data web model mashups.
Today's best institutional data processing operations are a sanity check. Today these are severely limited in scope and scale by workable information safety and quality limits. Model mashups within and among software packages requires ruinously expensive recurrent reverse engineering. Most high value mashups are impractical or infeasible. Those mashups that are done often suffer from reliability problems.
Any workable data web build-out will, in effect, be a huge worldwide data center. This will be millions of times larger than the largest data center operations today. This will involve myriad thousands of independent modeling contexts and myriad millions of models. This simply isn't going to fly with any metaphysical information modeling approach.
Today alternative mechanistic subject modeling methods are limited to the applied science automation software world. These scale without limit and provide fully manageable information safety and quality. Any workable Internet data web must and will ultimately these alterative methods.
Today these mature methods are unknown in the mainstream software world. There is no commodity infrastructure support for this sort of modeling. Moreover, this sort of modeling is incompatible with the huge legacy of SQL RDB data maintained today.
So for the foreseeable future the Internet data web will be limited to a relatively small range of tactical applications that can tolerate information unsafely. These will provide some trivial value. The mother lode of Internet data web innovation value, amounting to at least a trillion dollars in financial market capitalization, will remain far out of reach.

bblfish said...

If you are going to speak about JSON and YAML then one may as well mention N3 the readable rdf syntax, that also comes with a very nice rules notation.

Concerning JSON, I should mentin that most SPARQL endpoints can return JSON representations of the result set.

As far as security of information goes, the way to do that in the Semantic Web, is to keep track of where you get your information from. This is often referred to as named graphs. You don't need to trust all the data you read. It's really up to you. That's how we deal with information everyday in this world. In fact it is quite easy to imagine how to build a simple yet practical address book with such potentially unreliable information.

jasonwatkinspdx said...

@ George A. Maney

Do you have any references to the information modeling methods you're talking about?

Anonymous said...

I was a fan of markup from way back. I pleaded for markup in the VRML 1.0 days. I pleaded for markup when VRML97 was created.

Other than getting to reuse the millions of lines of code available for processing raw XML, I was dead wrong.

1. Some object models are not document object models in any meaningful way past pushing syntax-conformant glops of strings across a network, extracting strings from one glop and pushing it into another glop.

2. Once you stop looking at the glop schlepping, you realize that a more compact glop syntax is easier on the eyes and the fingers. As you said, anyone can design a better syntax. We kept telling ourselves against our better angels that no one uses PFE to do this and yet we still do it every day. With VRML this is particularly true. One the models are built and the big long honkin' strings of vector values are frozen, 85% of the work is done in the ASCII editor cutting, pasting, and writing Javascript. With the VRML97 and Classic VRML curly syntax, this is easy. With the XML syntax, this is grotesque.

3. An object model for a real-time 3D graphics system isn't really a tree. It is a graph. The root in X3D is there just to make XML happy. VRML is a set of objects that are type-constrained because the whole idea is based on type-compatible event cascades through access types (eventIns and outs) and the getting and setting of exposed field values. Rendering and behavioral fidelity are equal requirements in these systems. Notions like CSS don't exist and shouldn't.

The result of hammering VRML97 into XML was an XSD that only a mother could love. It's only use IS validation. Binding is nearly impossible because the need to preserve the original clean legacy VRML97 led to an XSD that exposes the original curly syntax constraints in the XSD. Also because VRML97 was an object language to begin with, therefore fields CAN contain nodes, the infamous impedance mismatch is up front and ugly in the XML.

One wonders if your speculations will result in a unified cleaner way for the next generation of real-time graphics apps to advance. We are stuck for awhile, but I do wonder if a language that put the types front and center might not be better.

len bullard

Anonymous said...

json is fine if all the datatypes you ever want to express are those allowed in javascript.

json is javascript. you cannot do anything with json that is not allowed under the ecmascript spec, because that is where json as we know it is extracted from.

json cannot define recursive data structures, for example. you may create an object literal with memebers that look alike, but they are not typed as such (this is not in dispute by json devotees).

there have also been some json exploit published recently. xml isn't being executed, another advantage.

Anonymous said...

> "xml isn't being executed",
yet. See XSL Transformations
in the April 2007 issue Dr. Dobb's.

Florent Georges said...

Hi

FYI, speaking of JSON and XML, it is now possible to handle JSON in standard XSLT 2.0, with the help of FXSL (http://fxsl.sf.net), containing now a JSON parser written in XSLT.

BTW, I'm glad to hear you again, James!

Regards,

--drkm

Anonymous said...

Apart from the fact that X4j is not widely supported yet. Why not extend JSON with XML literals? You can then send pure XML or pure data or a judicious mixture? Yhe idea being that XML is reserved for sending structured text.

The proposed Javascript 4 has the datatypes you mention, so perhaps a future JSON could do the job.

BTW the capcha does not appear in Firefox, only IE.

Rob said...

If the VRML guy is still listening do you think you reexplain your epiphany? I'm unfamiliar with "PFE" and I don't know what the curly syntax looked like or why it was good.

But I always wondered about VRML. 3D graphics have always lived or died on how well the application performed... abstracting a text-based syntax that no one is ever going to be able to hand write or read, and then giving it an atrocious UI just never sounded very promising. But the X3D people are still using it, so maybe they know something I don't.

Or maybe MS will kill them w/ Silverlight.

Unknown said...

I would like to invite you to my page about xml alternatives for data:
http://www.geocities.com/charles.debon

Anonymous said...

It seems to me, looking at JSON, that some programmer who likes terse languages looked at XML and said "this is far to easy to read, let's get something that looks more archaic" - and invented JSON.

There seems to be this idea that "verbose is bad" - the same type of developer who rejects an easy to read language like visual basic in favout of something where you have to carefully scrutinise every line, like C#.