Twitter and Foursquare recently removed XML support from their Web APIs, and now support only JSON. This prompted Norman Walsh to write an interesting post, in which he summarised his reaction as "Meh". I won't try to summarise his post; it's short and well-worth reading.
From one perspective, it's hard to disagree. If you're an XML wizard with a decade or two of experience with XML and SGML before that, if you're an expert user of the entire XML stack (eg XQuery, XSLT2, schemas), if most of your data involves mixed content, then JSON isn't going to be supplanting XML any time soon in your toolbox.
Personally, I got into XML not to make my life as a developer easier, nor because I had a particular enthusiasm for angle brackets, but because I wanted to promote some of the things that XML facilitates, including:
- textual (non-binary) data formats;
- open standard data formats;
- data longevity;
- data reuse;
- separation of presentation from content.
If other formats start to supplant XML, and they support these goals better than XML, I will be happy rather than worried.
From this perspective, my reaction to JSON is a combination of "Yay" and "Sigh".
It's "Yay", because for important use cases JSON is dramatically better than XML. In particular, JSON shines as a programming language-independent representation of typical programming language data structures. This is an incredibly important use case and it would be hard to overstate how appallingly bad XML is for this. The fundamental problem is the mismatch between programming language data structures and the XML element/attribute data model of elements. This leaves the developer with three choices, all unappetising:
- live with an inconvenient element/attribute representation of the data;
- descend into XML Schema hell in the company of your favourite data binding tool;
- write reams of code to convert the XML into a convenient data structure.
By contrast with JSON, especially with a dynamic programming language, you can get a reasonable in-memory representation just by calling a library function.
Norman argues that XML wasn't designed for this sort of thing. I don't think the history is quite as simple as that. There were many different individuals and organisations involved with XML 1.0, and they didn't all have the same vision for XML. The organisation that was perhaps most influential in terms of getting initial mainstream acceptance of XML was Microsoft, and Microsoft was certainly pushing XML as a representation for exactly this kind of data. Consider SOAP and XML Schema; a lot of the hype about XML and a lot of the specs built on top of XML for many years were focused on using XML for exactly this sort of thing.
Then there are the specs. For JSON, you have a 10-page RFC, with the meat being a mere 4 pages. For XML, you have XML 1.0, XML Namespaces, XML Infoset, XML Base, xml:id, XML Schema Part 1 and XML Schema Part 2. Now you could actually quite easily take XML 1.0, ditch DTDs, add XML Namespaces, xml:id, xml:base and XML Infoset and end up with a reasonably short (although more than 10 pages), coherent spec. (I think Tim Bray even did a draft of something like this once.) But in 10 years the W3C and its membership has not cared enough about simplicity and coherence to take any action on this.
Norman raises the issue of mixed content. This is an important issue, but I think the response of the average Web developer can be summed up in a single word: HTML. The Web already has a perfectly good format for representing mixed content. Why would you want to use JSON for that? If you want to embed HTML in JSON, you just put it in a string. What could be simpler? If you want to embed JSON in HTML, just use <script> (or use an alternative HTML-friendly data representation such as microformats). I'm sure Norman doesn't find this a satisfying response (nor do I really), but my point is that appealing to mixed content is not going to convince the average Web developer of the value of XML.
There's a bigger point that I want to make here, and it's about the relationship between XML and the Web. When we started out doing XML, a big part of the vision was about bridging the gap from the SGML world (complex, sophisticated, partly academic, partly big enterprise) to the Web, about making the value that we saw in SGML accessible to a broader audience by cutting out all the cruft. In the beginning XML did succeed in this respect. But this vision seems to have been lost sight of over time to the point where there's a gulf between the XML community and the broader Web developer community; all the stuff that's been piled on top of XML, together with the huge advances in the Web world in HTML5, JSON and JavaScript, have combined to make XML be perceived as an overly complex, enterprisey technology, which doesn't bring any value to the average Web developer.
This is not a good thing for either community (and it's why part of my reaction to JSON is "Sigh"). XML misses out by not having the innovation, enthusiasm and traction that the Web developer community brings with it, and the Web developer community misses out by not being able to take advantage of the powerful and convenient technologies that have been built on top of XML over the last decade.
So what's the way forward? I think the Web community has spoken, and it's clear that what it wants is HTML5, JavaScript and JSON. XML isn't going away but I see it being less and less a Web technology; it won't be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.
In the short-term, I think the challenge is how to make HTML5 play more nicely with XML. In the longer term, I think the challenge is how to use our collective experience from building the XML stack to create technologies that work natively with HTML, JSON and JavaScript, and that bring to the broader Web developer community some of the good aspects of the modern XML development experience.
42 comments:
Thanks for your thoughtful post. XML certainly could be simpler, and I'd like to see some of those changes happen.
However I'm not sure I agree with what seems to be your conclusion that we should be reinventing schema languages, XSLT, XQuery, XProc etc. for JSON.
The main advantage of JSON that you point out is that it easily serializes and deserializes the native data structures of Javascript (and to a certain extent, other dynamic languages). In that case, doesn't that point to the need to ditch the DOM, replacing it with language native APIs that are easier to use?
The real impetus for JSON was that browser treatment of XML data structures was clunky and awkward because of the assumption that XML was only for content. If it weren't for that, JSON could have been a very simple and fixed XML schema such as:
start = list | object
value = item | list | object
list = element list { value* }
object = element object { (key, value)*}
key = element key {text}
item = element item {type?, text}
type = attribute type {"number" | "boolean" | "null"}
It would have been no trouble to have a library function that accepted that particular schema, or some close variant of it, and delivered a language-specific in-memory representation.
As a member of (but not officially speaking for) the XML Core WG, the downside of delivering a unified document is primarily that it's a lot of work for little gain. DTDs might well go, but parser writers have already paid the price of them. XML 2.0 would have to deliver some really new and compelling features in order to make it worth the cost of adoption, and no one has proposed any such features.
This, I think, is the heart of the matter:
a big part of the vision was about bridging the gap from [SGML to the Web]… In the beginning XML did succeed in this respect. But this vision seems to have been lost sight of over time to the point where there's a gulf between the XML community and the broader Web developer community; all the stuff that's been piled on top of XML [makes] XML be perceived as an overly complex, enterprisey technology
I've been saying for some time now that the same vendors and remora-like consultants, who turned SGML into an impenetrable morass for small low-budget shops, were doing the same to XML. It appears that they have succeeded.
Vendors want lock-in, and the only way to accomplish that when you're using a simple non-proprietary file format is to add layers of complexity: huge DTDs, style sheets that require a 500-page manual, databases, etc. etc. Consultants go right along with it, because all that complexity means it's not going to make sense to anyone who isn't an IT wizard.
Maybe what's required is a periodic reboot of XML. Strip all the cruft and start over with simple (and replaceable) over-layers that support the Web with minimal overhead. Then reboot again five years later after the vendors manage to screw that up.
It will swing full circle, it has to swing full circle, because when you start with JSON, you eventually find you need attributes on certain values (such as language, datatype and namespace), the you build schemas to support that for different domains, then you standardize and you end up with XML or something else like Turtle or N3.
Most of this is determined by how people can access the data once it's deserialized, and JSON of course maps right to javascript objects (or native objects in most languages), so it simply wins hands down in the interim.
If however one can map XML (or any other serialization) directly to a friendly javascript structure with nice properties "name" rather than full URIs or namespaced properties, then it's a win all round for all communities.
Most "web developers" don't really care what format the data is transferred or published in, they care about how they can access it, how many hoops they have to jump through, and how fast the serialization/deserialization process is.
A telling sign is that node.js doesn't even have XML support of it's own, and the current extensions don't even work. So it's JSON or custom formats all the way.
As the silos break back down, and interoperability comes back to the fore, so will these issues - thus, I'd suggest that it's probably time to get working on the object-view abstractions of universal data formats, and the plan for when the *** hits the fan in the near future. There's a general "we're content in our ignorance" going on at the minute around the HTML 5 + JS sectors, which will explode soon.
Great post,
Best,
Nathan
ps: what happened to EXI?
XML-SW: http://www.textuality.com/xml/xmlSW.html
There is additional discussion between Norm Walsh and myself about his "Deprecating XML" post and why JSON is "winning" here: Web Services: JSON vs. XML
In the longer term, I think the challenge is how to use our collective experience from building the XML stack to create technologies that work natively with HTML, JSON and JavaScript, and that bring to the broader Web developer community some of the good aspects of the modern XML development experience.
I believe E4X tried to bridge that gap and hasn't received enough traction from the web dev community. No, it's not perfect, but it is nicer than working with the DOM.
I think the evolution of XML can be held up as a poster child for the 90s/00s penchant for creating standards that are unwieldy, overly complex, and only serve a small audience yet are foisted off on a larger one. During the height of XML madness, everything under the sun was relegated to some form of XML DTD or XML Schema. While this is most likely not a big issue for documents managed by tools (other than the processing time and document sizes), the fallacy of XML being "human-readable" was persisted. Also, as near as I can tell, most attempts to move to a compact, binary format were beaten to death.
Data that is only managed by machines doesn't *have* to be textual-only, but it seems that most people can't cope with multiple ways of representing data and are seeking after a non-existent silver bullet data format.
Not using XML for web pages is the developer's loss. I created an online ordering system using XML/XHTML throughout, recently, with XSL and, yes, some JSON thrown in. It was pure joy to work with and not have to bother with HTML for the most part. ALL web pages should be created this way, at least by the professional developer.
I remember the days before the PC when the best microprocessor was the 68000 from Motorola. IBM selected the Intel x86 and we've suffered for it since. Such discussion as this of XML vs JSON and HTML reminds me of those days. It makes me sad.
Great analysis James. Thanks for writing it up.
Good post. I think with eyes on the prize, I still want what you say:
*textual (non-binary) data formats;
*open standard data formats;
*data longevity;
*data reuse;
*separation of presentation from content.
XML has failed these needs exactly as you've described things, indirectly, by starting in the right direction, and then piling on so many layers of complexity that the level of interest and implementation was bound atrophy, stranding some of those who put their trust in XML to meet those needs.
As I mentioned on XML-DEV earlier this month, we can blame thespec-making organization for e.g. the enterprisey XPath 2.0 data model, the AI-zation of RDF and the CORBA-ization of Web services.
Browser developers and the lesscode crowd have rightly turned their back on all this nonsense and continued to focus on a myriad of simple, scrappy specs that live or die based on luck and merit.
I have no problem with JSON, and have used it more and more, but it still lacks 2 key bits: mixed content and attributes. I still think there is room for XML, if there were a way to start afresh with the first round of XML specs, most of which are broadly implemented, even in browsers, such as XML, Namespaces (a fix for xmlns would be nice, but we can live without it if need be), XPath 1.0, XSLT 1.0+EXSLT 1.0 and RELAX NG (OK that's not so well implemented, but I suspect it's because WXS had already come along to put people off schemata).
If there were any such initiative to do this, I'd be happy to participate, help evangelize and implement, and I bet it could get traction, because it builds on what's simple, and mostly already implemented.
Hi James - great post. Thanks for breathing some fresh air into this topic.
Some topics we have been discussing in the TAG quite a bit are the issue of "round-tripping" HTML5 through XML-based workflow tools and using XML name spaces to extend HTML5. These have been some of the drivers for our support of the so-called polyglot spec (XML serialization of HTML5). Do you think these are important use-cases and what do you think of the HTML5 polyglot work?
Dan
@Dan Appelquist
I think the Polyglot spec is useful but I'm not convinced that it's asking the right question. I want to use XML tools to produce HTML5, but I don't see any benefit in serving them as text/xml rather than text/html, so I think the constraint that a polyglot document produce identical trees whether parsed as XHTML or HTML is unnecessary. It is sufficient for me if the document is well-formed XML and valid HTML5. Tantek Çelik suggested something similar.
James,
I agree that serving the same content as both html and xml is pretty much a non goal, but the danger of just aiming for "well formed xml that is vald html5" is that the parse trees can be so different and lead to nasty surprises, which the polyglot spec is trying to help people avoid.
The worst (which killed the attempt a couple of years back to reformat w3c specs as xhtml served as text/html) is the interaction of the /> syntax with the html parser.
Something innocuous like
<p><a id="a"/> zzz </p>
<p><a id="b"/> yyy</p>
(a style used often by xmlspec for inline definitions) gets parsed as
<p><a id="a"> zzz </a></p><a id="a">
</a><p><a id="a"></a><a id="b"> yyy</a></p>
with the element being repeatedly re-opened.
Personally I'd fix this by making /> mean empty element in html rather than making a polyglot spec that tells people how to avoid the issue, but that's the current situation.
@David
As far as I can tell from the HTML5 spec,
<p><a id="a"/> zzz </p>
isn't valid in the HTML syntax (though it does have a defined parse behavior).
But you're right that my formulation isn't enough.
<p><a id="a"></a> zzz </p>
is XML well-formed and HTML5 valid, but it will be a problem for an XML toolchain because XML tools don't typically treat use of empty tag syntax as semantically significant and so may well turn that into your example, which is not HTML5 valid. The Polyglot spec's formulation also has this problem.
Yes thanks for the correction, that's more or less what I meant (but not what I said:-) Of course with XSLT you don't get to choose and systems typically use /> so this is almost always going to go wrong if served as text/html.
If you use xhtml namespaced elements and xslt2's xhtml output serialization it will get it right more often, but since the use of namespace declarations in html syntax is not exactly encouraged in html5, many people will generate no-namespace html and then xslt2's xhtml method does nothing special.
Of course what is safe and still useful is to use an xml toolchain to generate the document but finally serialise as html. That's OK so long as you know it's the end of the chain and not going to be processed further (or only processed by people with an html parser)
JSON also has another huge advantage over XML: It reduces data traffic significantly. I think this aspect is of extreme relevance when applications have to process thousands and thousands of records.
If the "new" Web is all HTML5/Javascript/JSON, where does ATOM/RSS fit in to this model, will there be a new format with curlies instead of angled brackets?
The essential issues with XML and the Web are, as far I can tell, historical and largely to due with poor process at the W3C.
For example: why can't I use CSS with XML to do everything HTML can do? Why is an "a" tag still the only way to do a link in a browser with pure markup? And the answer is...because the CSS people won't go for it. Or so I've been told.
If XML had been usable on the web a decade ago, in a simple fashion that easily transitioned from CSS, we wouldn't be seeing the HTML5/JSON approach James discusses. The XML technologies in the browser would've matured appropriately.
I worry about the HTML5/JSON approach, frankly. I've talked to a lot of people on both sides of the fence, and the XML people's answer is, "Well, guess we lost that one."
The HTML5 people, on the other hand, appear largely ignorant -- not just of XML's benefits, but of real programming basics. e.g. I heard, not so long ago, a rep for HTML5 claim that it's just as easy to write an HTML parser as it is an XML parser...which is ludicrous hooey, but it's the kind of garbage people are swallowing. Given the mess the HTML5 standards process has been, I suppose I shouldn't be surprised...
@Ben Trafford, I can't disagree with most of what you said, but the CSS + XML statement appears pretty off base: http://www.w3.org/Style/styling-XML.html
I haven't followed the standards process back then, but I would imagine you can't do this because of the browsers, not anything prohibiting it in the standards.
This thread has also been picked up on the xml-dev list, and is spawning some interesting commentary there.
@shoebappa: Look a little more closely. Yes, you can associate a CSS stylesheet with XML, but there a host of HTML behaviours that remain embedded in HTML. And really, they should have been torn out and put in CSS a long time ago.
An ideal world, in my view:
A browser reads a document. If it sees an HTML-style tag, it defaults to the base HTML behaviour. It then overlays the stylesheet, and applies behaviour based on the stylesheet.
This would allow both for display of pure XML in browsers with all the capabilities of HTML; it would allow for straight HTML; and it allow for mixed content, where someone wants to introduce an XML document fragment into an otherwise HTML document.
I still can't make a link via CSS. This is a rather silly situation.
As a developer who has used both technologies from their inception...
XML has become too complicated. While DTD, XSLT, XPath, XQuery, etc all had good intentions, they have complicated XML until it is no longer practical to use, except for application configuration and passing data between computers ON THE BACKEND.
JSON has the advantage on the frontend of tight integration with Javascript (requiring no additional controls or parsing) and compact data. The same data in JSON is easily half the size of using XML.
And the catalyst for all this is not browsers or app design, but phone web applications! HTML5/CSS3 have been advanced by the browsers that come with iPhone and Android. On the mobile platform, data transmission is expensive... and is why JSON (which has been around for a few years) is now exploding.
As a developer, I will use XML for app config and backend data passing, but JSON now rules for passing data to a (mobile) browser client.
When I look at JSON vs XML. I think of the two as having the same difference as dynamic vs statically typed languages. JSON is gaining traction largely because dynamic languages are gaining traction. When you need something with better guarantees then you go to XML.
As a web dev, my reaction to seeing anything in XML is usually dismay. I just know that I'm going to have to deal with something horribly overcomplicated, overdesigned, verbose and intractable. Even the most trivial XML config file (e.g. OS X plists) is utterly unsuited to human consumption and usually require a ridiculously overweight software stack to process properly. SOAP interfaces are a classic example: I've never seen one that I'd class as usable or performant - salesforce.com is just a joke - or that could be used without thousands of lines of library code.
Overall, XML: great idea, awful reality.
I believe E4X tried to bridge that gap and hasn't received enough traction from the web dev community.
E4X is fantastic. Unfortunately, it's only supported in Firefox (and other browsers using the same JS engine), and even there you need to use a special "e4x=1" flag around your scripts (but only sometimes?).
So I would agree with you, if by "web dev community" you mean "people who *write* web browsers". Those of us who write *for* web browsers love E4X, and would use it 1000 times a day, if only it was supported.
It seems all the browser projects got sidetracked for a couple years in the JS Speed Race, where the only thing that matters is how big the bargraph next to your name is in some arbitrary JS benchmark. Which is great for some things, and I can't fault them for it, but it was kind of lousy for making feature progress on JS-the-language.
XML had my support and enthusiasm. I honestly have it a chance, even served my XSTL-generated website as application/xhtml+xml.
but…
DTD is idiocy.
I facepalmed when I discovered that to write XML parser (it's strict, so it's easy to parse, right? hello?), one has to write SGML parser first, and not make it susceptible to million laughs attack.
And draconian handling started to bite me left and right. I had my web site broken by mobile operator's proxies (sure, wouldn't happen in alternate universe where everyone used XML).
I had my site broken by my own tools (my bad. But my users saw the error, and I didn't!)
I had to parse data feeds from 3rd party that couldn't wrap head around whole escaping concept. Totally their fault, and they should fix it, says XML. But my boss, my deadlines said otherwise.
Hi James,
I certainly think that it's possible for XML and JSON to co-exist. We've taken advantage of both at skechers.com (check the source and the Net tab of Firebug). XML still seems like a viable technology, and I don't think anyone is necessarily moving away from it.
FWIW, I got around the <a id="a"/> problem by coding my XSL to output <a id="a"><!-- a --></a>.
Sticking that comment in there kept the XML tool from closing it up to an empty element.
Json is so compact and convenient, but I still find myself sending arrays as they can be equally compact and much easier to traverse client side.
hi guys. dumb question: it seems that there's a nice plus to using XHTML if your data has a link structure. while debugging your app, you can navigate around your data by simply pointing your browser to the root. (analogy: file: pages.). the served XHTML is both human and computer consumable.
but if we ditch XML, I think we lose this dual consumable property. james, is this what you were referring to by mixed content? would appreciate some illumination. thx.
There is no difficulty to develop a support of XML inside browsers which could give JSON-like access : root.tag1.@attr
That is done in script languages like Python or Scala.
The problem of XML in the code is that accessing nodes from compiled languages is hard :
- DOM is incredibly poor (getElementsByTagName is simply useless)
- XPath is good but not performant (an evaluation is several miliseconds long).
- deserialization into POJOs is a nightmare, you just spend your developer's life to modify getters and setters
I have seen a very good improvement with ElementTraversal specification. but it misses a performant method getChild(String childName), accessing a HashMap of the children.
All that is only coming from HTML which does not self-close its tags ?? That is sad.
There is something of a natural cycle here. As a technology (let's call it that for simplicity) like XML becomes widely used over time, more and more standards appear to extend and refine it. More and more tooling appears to extend it's use into additional scenarios. The result is the complexity you lament in the case of XML.
Eventually another technology comes along (witness JSON in this case) to reduce the clutter and focus on some specific scenarios of key concern.
JSON will certainly follow this same path, until another technology appears to reduce the complexity that by that time will surrounds JSON. It's a natural evolution, and the key as developers is to evolve with it.
You can discuss is issues like XML vs. JSON on thousands of pages and it stills boils down to the epic decision: safe validated types, or duck typing, which you're also solving when picking programming language. JSON is just better choices for Javascript/Python/Ruby programmer, XML does better job with Java and C++.
While JSON may be gaining traction, I believe that XML and JSON serve different purposes by design. JSON is a data structure containment based system. XML is an extension based data representation scheme. In large part I agree that JSON is not well suited for some tasks, however, when it comes to programming : JSON contains simple rules compared to the complexities of tag attributes and mixed content. XML uses these attributes but does not enforce them nor give good reason to use them compared to wrapping things in tags from the many times I have used them. Consistently checking if I am in a mixed content node (lord knows if someone inserts a tag where there shouldn't be one) or an attribute node exists (did it get defaulted by a namespace?) instead of a consistent means of accessing data being used as such is confusing when deadlines are important. Schemas are amazing, and I wish it were easier to have them supported, but largely vendors will not give correct formed XML at all times, with JSON the risk is largely reduced due to lower complexity. Having mixed content and attributes is a strength for development of content in XML, but I largely feel that this added complexity to the programming side is hard to deal with compared to the simple set/map/value structures that are presented in JSON.
Just thought I'd chime in with a thought on this oft debated subject as I've been developing with XML for the better part of the last 10 years and have also dabbled a bit with JSON over the last 3 years or so.
The attraction to JSON for most web developers is that it was designed with no-frills JavaScript serialization right from the start. It is the right tool for the job because it was designed for a very specific purpose.
XML on the other hand was supposed to be a simplified yet compatible subset of SGML, that vastly complex world of descriptive markup. XML was to solve the complexities of SGML while preserving the benefits. It is flexible enough to describe data between client and server, but that's just one of its capabilities not its specialization.
The way I see it XML and JSON don't necessarily compete with each other, they are equally adept at what they do. Their scope is what is different.
My hope is that JSON doesn't become as arcane and frustrating as XML seems to be for many newcomers, time will tell on that one though.
Speaking as one of many on the Web Development community: with SVG getting support in all major browsers (including IE9), I expect my interest in XML to go back up, but that's about it.
EPUB2 uses XHTML 1.1. EPUB3 will use XHTML5 and will not allow non-XML HTML5. BTW, EPUB uses RELAX NG and NVDL.
I've taken a look at the current JSON schema proposals, and it appears to me that they miss the mark. Specifically, it appears to me that they don't allow context-free-grammar specifications as Relax NG does. Perhaps I'm being na\"ive, but it appears to me that there's a golden opportunity to just strip down the Relax NG syntax to match JSON: get rid of element tags and attributes, formulate a simple syntax for name->value mappings.
The result would appear to be a usable and efficient schema language for JSON.
No?
"In particular, JSON shines as a programming language-independent representation of typical programming language data structures."
What about references? Isn't this a typical programming language data structure that is missing in JSON?
While JSON may be gaining traction, I believe that XML and JSON serve different purposes by design. JSON is a data structure containment based system. XML is an extension based data representation scheme. In large part I agree that JSON is not well suited for some tasks, however, when it comes to programming : JSON contains simple rules compared to the complexities of tag attributes and mixed content. XML uses these attributes but does not enforce them nor give good reason to use them compared to wrapping things in tags from the many times I have used them. Consistently checking if I am in a mixed content node (lord knows if someone inserts a tag where there shouldn't be one) or an attribute node exists (did it get defaulted by a namespace?) instead of a consistent means of accessing data being used as such is confusing when deadlines are important. Schemas are amazing, and I wish it were easier to have them supported, but largely vendors will not give correct formed XML at all times, with JSON the risk is largely reduced due to lower complexity. Having mixed content and attributes is a strength for development of content in XML, but I largely feel that this added complexity to the programming side is hard to deal with compared to the simple set/map/value structures that are presented in JSON.
Hi James,
"In the short-term, I think the challenge is how to make HTML5 play more nicely with XML."
By reference, clearly. On the Web, that means links, link annotations, etc.: http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven.
Post a Comment