2007-04-11

Validation not necessarily harmful

Several months ago, Mark Baker wrote an interesting post entitled Validation considered harmful. I agree with many of the points he makes but I would draw different conclusions. One important point is that when you take versioning into consideration, it will almost never be the case that a particular document will inherently have a single schema against which it should always be validated. A single document might be validated against:
  • Version n of a schema
  • Version n + 1 of a schema
  • Version n of a schema together whatever the versioning policy of version n says future versions may add
  • What a particular implementation of version n generates
  • What a particular implementation of version n understands
  • The minimum constraints that a document needs in order to be processable by a particular implementation
Extensibility is also something that increases the range of possible schemas against which it may make sense to validate a document. The multiplicity of possible schemas is the strongest argument for a principle which I think is fundamental:
Validity should be treated not as a property of a document but as a relationship between a document and a schema.
The other important conclusion that I would draw from Mark's discussion is that schema languages need to provide rich, flexible functionality for describing loose/open schemas. It is obvious that DTDs are a non-starter when judged against these criteria. I think it's also obvious that Schematron does very well. I would claim that RELAX NG also does well here, and is better in this respect than other grammar-based schema language, in particular XSD. First, it carefully avoids anything that ties a document to a single schema
  • there's nothing like xsi:schemaLocation or DOCTYPE declarations
  • there's nothing that ties a particular namespace name to a particular schema; from RELAX NG's perspective, a namespace name is just a label
  • there's nothing in RELAX NG that changes a document's infoset
Second, it has powerful features for expressing loose/open schemas:
  • it supports full regular tree grammars, with no ambiguity restrictions
  • it provides namespace-based wildcards for names in element and attribute patterns
  • it provides name classes with a name class difference operator
Together these features are very expressive. As a simple example, this pattern
attribute x { xsd:integer }?, attribute y { xsd:integer }?, attribute * - (x|y) { * }
allows you to have any attribute with any value, except that an x or y attribute must be an integer. A more complex example is the schema for RDF. See my paper on The Design of RELAX NG for more discussion on the thinking underlying the design of RELAX NG. Finally, I have to disagree with the idea that you shouldn't validate what you receive. You should validate, but you need to carefully choose the schema against which you validate. When you design a language, you need to think about how future versions can evolve. When you specify a particular version of a language, you should precisely specify not just what is allowed by that version, but also what may be allowed by future versions, and how an implementation of this version should process things that may be added by future versions. An implementation should accept anything that the specification says is allowed by this version or maybe allowed by a future version, and should reject anything else. The easiest and most reliable way to achieve this is by expressing the constraints of the specification in machine-readable form, as a schema in a suitable schema language, and then using a validator to enforce those constraints. I believe that the right kind of validation can make interoperability over time more robust than the alternative, simpler approach of having an implementation just ignore anything that it doesn't need.
  • Validation enables mandatory extensions. Occasionally you want recipients to reject a document if they don't understand a particular extension, perhaps because the extension critically modifies the semantics of an existing feature. This is what the SOAP mustUnderstand attribute is all about.
  • Validation by servers reduces the problems caused by broken clients. Implementations accepting random junk leads inexorably to other implementations generating random junk. If you have a server that ignores anything it doesn't need, then deploying a new version of the server that adds support for additional features can break existing clients. Of course, if a language has a very unconstrained evolution policy, then validation won't be able to detect many client errors. However, by making appropriate use of XML namespaces, I believe it's possible to design language evolution policies that are both loose enough not to unduly constrain future versions and strict enough that a useful proportion of client errors can be detected. I think Atom is a good example.

7 comments:

halindrome said...

I appreciate your perspective that multiple schema might apply to any document... but as a document creator, don't I have the right to declare what schema I want you to use? It is, after all, my document.

John Cowan said...

Shane: You have the right to declare it, and I have the right to ignore your declaration and apply another.

Perhaps I'm not even interested in the content of your document: maybe all I want to do is count it so that I can report how many documents there are. In that case, all questions of validity or invalidity go by the board. Less radically, I may be interested only in certain parts of your document, in which case I may apply a much looser schema to it than the one you used to validate it when you published it. Contrariwise, I may want to accept only a certain subset of your stream of documents, those which meet the requirements of a tighter schema than yours.

Marc de Graauw said...

Good points: validation is not a relation between a document and a single schema.

Coincidentally, yesterday an article (disclaimer: by me) was published in XML.COM were I argue one should list in an XML instance all known versions which (you know) may be used to validate against; this loosely couples the instance to a set of schema's. The receiver then uses this information to decide which schema to validate against (not necessarily one in sender's list).

Josh Peters said...

There are at least three separate occasions when validation should be applied: initial receiving of a document, the result of transforming a document, and outputting that document for others to use.

It is necessary that an XML document validate as an XML document (the rules of which we hopefully all know by now). After validating that a given document is indeed XML I agree with James' belief that strict validation does not add much value.

The maxim the web follows is a good one: "Be liberal in what you accept, and conservative in what you send". Requiring XML documents to be valid according to a strict definition defeats this maxim. Therefore the strict validation of input should only happen after a round of tranformation/filtering, not before. After I apply an XSLT document to copy the XHTML out of a given datasource I should make sure that it still is a XHTML conforming document.

Finally, if I am to hand that document (or any other document) off to another service, it benefits me to validate it so that I can be sure that I am being conservative in what I am sending.

Thus we have three levels of validation in an XML work flow. Validating before transforming/filtering adds an unnecessary and often detrimental period of nagging which can easily be avoided.

And this is why (among other reasons) that I believe that XHTML user agents should silently ignore any unknown namespaces they encounter. Let my document contain RDF, SVG, MathML, or my own namespace for an internal process, why should you care?

tobe said...

You make some very valid points, all denouncing the current fallacy-de-jour that there is one correct schema that is tightly bound to a document.

However, I think you go wrong when you see every schema as defining a language, unless "Me Tarzan, you Jane, him Cheetah" can be considered a language. In EDI you typically send name-value pairs. Letting the name correspond to the tag-name and the value correspond to the value, with some tags having compound values, is the nice way of doing it. Defining that as a language seems a complicated way of doing it.

Also, in EDI exchanges, it is difficult to orchestrate everyone converting at the same time. The data still needs to flow around the clock.

Finally, how often do you really know what you will need tomorrow?

When a modification to a "schema" results in semantic differences I think the best solution is to change the namespace. The SOAP mustUnderstand attribute is a nice idea in some circumstances, I suppose.

So I'm with Mark on this.

tobe said...

Further thoughts:

Well-formedness is a very important criterium and needs to be strictly enforced. As long as the document is well-formed, no process will "croak" on it. Beyond that, I don't generally see the need for great stringency. What real value does a SOAP envelope add (apart from complexity leading to greater salability of tools)? What value is added by saying that the servlet-name element has to be the first child of a servlet-mapping element (it certainly makes things more difficult for users, though)? If a browser vendor has a special interpretation of a proprietary-magic element inside an xhtml document, does it do any harm (except that people will learn to be careful about using it)?

Where schemas go wrong is in mixing an information model description with current business rules and overspecifying the format. Well-formedness takes care of the parsability issue, let's handle the rest on the business level. Using schemas as model descriptions is unnecessarily complex, examplotron was a step in the right direction, and, http://tobe.homelinux.net/xis is another attempt (although it would benefit by more work).

Anonymous said...

Mark is a self-thought 'scientist' ops, tells a lot. So I would not worry about anything he writes.

He also scripts in his high level rambling and frankly I would like to see his DCE, DCOM, RMI, and other skills tested.