James Clark's Random Thoughts: 2008-11

2008-11-23

MGrammar

MGrammar (Mg) is another key part of Oslo. My first reaction to Mg was: Yet another lex/yacc clone. Yawn. But now that I've looked at it a bit more closely, there are some features I find quite interesting.

I have always found parser generators to be a bit of a pain. I think one of the reasons is that the input to the parser generator typically mixes together a declarative specification of the grammar with procedural code that does something with the parse. There's not a clean separation between my code and the generated code.

Mg works in a rather different way. The specification in Mg is purely declarative. So how does it actually do anything useful? It constructs a labeled tree (actually a DAG) that represents the result of the parse. Mg has language constructs that allow you to control what tree gets constructed, but there's a reasonable default.

Another big difference is that it works much more dynamically than a typical parser generator. The generated parse tree is not strongly typed: it's just nodes with textual labels. You don't have to compile that parser ahead of time (although you can if you want). You can just give the library your grammar and it will compile it into an efficient form; you can them apply that compiled form to an input stream and get a parse tree.

The overall programming experience seems to be much more like using a regex library: at runtime, the regex gets compiled into an executable form; executing the regex tests whether the input matches the regex; and if it does match, you get structured data out (typically an array of strings, one for each captured group). I think this sort of programming model is much more convenient; it's particularly nice for a dynamic language which can potentially deal very conveniently with the untyped parse tree that you get from Mg.

Another interesting feature of Mg is that it has modules. The module system together with the fact that the grammar doesn't include any procedural code opens up the possibility of reusing grammars for languages and fragments of languages. It's hard to say how useful this will actually be in practice.

There's one more feature that's worth mentioning. You can attach annotations (called attributes) to production rules. These annotations can have structure (the same kind of structure as the parse tree). For example, the annotations might tell a text editor how to provide syntax aware editing features: in the Microsoft implementation, there's an annotation that the editor uses to highlight keywords.

The obvious missing feature is that there's no way to automatically go from the parse tree back into the textual form. I assume Microsoft will fix that.

I noticed only one thing that was really broken: Mg supports Unicode, but in cases where you need to specify a single character, it requires a 16-bit code unit representing part of the UTF-16 encoding of a character, rather than a code point (in the range 0 to 0x10FFFF). This is just wrong. It's slightly more work to do it properly, but you really can't avoid it. For example, you need it to properly support Unicode blocks/categories, since these blocks/categories are blocks/categories of code points not code units.

I hope we'll see some open source implementations of Mg: perhaps one in C/C++ hooked up to SpiderMonkey/V8/Python/Ruby/GNU Emacs and one in Java hooked up to Rhino/JRuby/Groovy/NetBeans/Eclipse.

There's a bigger issue lurking here. I think Microsoft see Mg as more than just a nifty library. It's part of their vision for a next generation application development platform, where developers become more productive by using custom DSLs rather than XML. I have mixed feelings about this. The syntax for M itself is defined using Mg, and Microsoft seems to be designing things so that much of the tooling that they build for M can easily be applied to anything with a Mg-defined grammar. The tooling seems to have quite an introspective feel to it, like a sophisticated Lisp or Smalltalk environment. The hacker side of me finds this quite cool.

On the other hand, gratuitous syntactic diversity is not a feature. I remember in the early days of XML, Tim Bray used to start his pitch for XML by showing a whole bunch of widely different Linux config file formats. It was quite compelling: the lack of consistency was obviously confusing and pointless. Now I don't think anybody would suggest that XML is the right format for everything. I wouldn't want to write programs in XML (except sometimes for XSLT :-), and after writing schemas in RELAX NG compact syntax for a while, I wouldn't want to have to go back to writing them in XML. How do you make your platform encourage developers to use a DSL where it makes sense, and discourage them when it doesn't? Up to now, part of the answer was that libraries made it a bit easier to use XML (or some other standard format) rather than some completely custom syntax; so unless there was a substantial benefit from a custom syntax, developers wouldn't bother. But if your platform provides tools that make it really easy to design new syntaxes, how do you avoid ending up in situation where every application has it's own private DSL? It doesn't help users if they have to learn a new syntax for every application. Certainly when I think about interchanging data on the Web, the fewer formats the better; I definitely don't want every application to be using its own completely custom syntax.

Some thoughts on the Oslo Modeling Language

Microsoft recently introduced Oslo. Microsoft seems to have designed Oslo to replace some of things it now uses XML for. Since Microsoft have been one of the biggest supporters of XML, I think it's worth looking at what they've come up with.

Overview of M

The key part of Oslo is the "M" language, which Microsoft calls a "modeling language". This integrates quite a broad range of functionality:

an abstract data model (analogous to the XML Infoset); M uses the term "value" to describe an instance of this abstract data model
a syntax for writing values (analogous to XML 1.0)
a type system which provides language constructs for describing constraints on values (analogous to XML Schema)
language constructs for querying values (analogous to XQuery)

I guess this is a similar range of functionality to SQL (although there are no constructs for doing updates yet).

I'm going to give a brief overview of M as I understand it. This is based purely on the spec. Please comment if I've misunderstood anything.

Let's start with the abstract data model. There are three kinds of value

simple values
collections
entities

Simple values are what you would expect (Unicode strings, numbers, booleans, dates etc). An important simple value is null, which is distinct from any other simple value.

A collection is a bag of values: it is unordered, it can have duplicates and it can contain arbitrary values.

An entity is a map from Unicode strings to values; each string/value pair is called a field.

A key feature of the abstract data model is that it's a graph rather than a tree. When the value of a field is an entity, the field conceptually holds a reference to that entity. I believe the same goes for collections, but I haven't completely grokked how identity works in M.

Simple values have the sort of syntax you would expect:

"foo" is a Unicode string (M calls string values Text)
123 is an integer
true is the boolean true value
2008-11-22 is a date
null is the null value

A collection is written in curly brackets with items separated by commas: { 1, 2, 3 } is a collection of three integers.

An entity uses a {field-name = field-value} syntax: { x = 2, y = 3 } is an entity with two fields. If the field name isn't an identifier, you have to surround it in square brackets: { [x coordinate] = 2, [y coordinate] = 3}.

You can label an entity or a collection and then use that label to specify a reference to that entity or collection. You label a collection just by putting an identifier before the opening curly brace. So

{ jack { name = "Jack", spouse = jill }, jill { name = "Jill", spouse = jack } }

would be a collection of two entities, where the spouse field of each entity is a reference to the other entity, and

loop { loop }

would denote a collection with a single member, which is a reference to itself.

A type in M can be thought of as being the collection of the values that conform to the type, so one way to specify a type is just to explicitly specify that collection. So, we could define a Boolean type like this:

type Logical { true, false }

In fact, M has predefined types corresponding to each kind of simple value. For entities, you specify a type by specifying the fields. So if we define Point like this:

type Point { x; y; }

then any entity that has both an x field and a y field is a Point. Note that entity types are open: an entity that has a z field as well as an x and a y field would conform to the Point type. You can also specify types for the fields:

type Point {
x : Number;
y : Number;
}

Types can be recursive:

type Node {
left : Node;
right : Node;
}

Collections can be specified using the * and + operators: Integer* is a collection of zero or more integers. There's also a ? operator, but it doesn't have anything to do with collections: T? is equivalent to the union of T and { null }.

So far, nothing very exciting. What makes M interesting is that it has a rich functional expression language that can be used for describing instances, types and queries. The analogy in the XML world would be the way XPath is used in instances via XPointer, in schema languages like Schematron and XSD 1.1 and in XQuery.

As well as the obvious kinds of expressions on simple values, you can have expressions on collections:

C | D is the union of C and D
C & D is the intersection of C and D
v in C tests whether v is a member of C
C <= D tests where C is a subset of D

Most importantly,

C where E

returns a collection containing those members of C for which E evaluates to true; the keyword value is bound to the member of C for which E is being evaluated. So

{ 1, 2, 3, 4 } where value % 2 == 0

returns { 2, 4 }.

What's really powerful is that expressions can be used on types in a similar way to how they are used on collections. If we want a type corresponding to even numbers, we can just do:

type Even : Integer where value % 2 == 0;

You can apply a where expression to an entity type definition to constrain the fields:

type ZeroSum {
x: Number;
y: Number;
} where x + y == 0;

You can also do something like:

type Name {
firstName: Text;
lastName: Text;
}
type Person {
dateOfBirth: Date;
} where value in Name;

Fields can have default values, for example:

type Person {
dateOrBirth : Date? = null;
}

In fact, when the type of a field is nullable (has null as one of its possible values), it automatically gets a default value of null. Similarly, when the type of a field is a collection that may be empty, it automatically gets a default value of an empty collection. This is quite clever: it makes fields declared as ? or * work nicely for both the provider and the consumer; the provider can leave the fields out, as would be natural for the provider, but the consumer always gets a sensible value for the field.

So we've seen how entity types can specify a default value. But how does an entity get connected with an entity type so that the default value can be created? Typing in M is structural. An entity in M doesn't inherently belong to any type other than Entity. Like a pattern in RELAX NG, a type in M describes a possible shape for a value, which any given value may or may not have, but a value isn't created with any particular type (other than one of the fundamental built in types).

M solves this by having a notion of type "ascription". An expression "v : T" ascribes the type T to the value v. M is functional so this doesn't mutate v, rather it returns a new value that has an augmented view of v as specified by T. So whereas

{ firstName: "James", lastName: "Clark" }.dateOfBirth

will throw an error,

({ firstName: "James", lastName: "Clark" } : Person).dateOfBirth

will return null (assuming Person declares dateOfBirth as nullable). (I would guess type ascription also does coercions on simple values.) Note that this implies that types in M aren't simply collections of values. M also allows entity types to have computed fields, which are virtual fields whose values are computed from the value of other fields.

The last area of M I want to look at is identity. You can specify what determines the identity for a entity:

type Person {
firstName: Text;
lastName: Text;
dateOfBirth: Date?
} where identity(firstName, lastName);

This means you can't have two Persons with the same firstName and lastName. The obvious question is: within what scope?

To answer, we have to look at the top-level layer that M provides. Values don't exist in isolation. Everything in M has to exist within a module. A module is a sort of top-level static entity. The fields of a module are called "extents". The scope of an identity constraint is an extent. So, if you do:

module Employee {
type Person {
    firstName: Text;
    lastName: Text;
    dateOfBirth: Date?
} where identity(firstName, lastName);
Persons: Person+
}

then it would be an error if within the Persons extent, there were two Person entities with the same firstName and lastName. There's also a way to automatically give a field a unique id:

type Name {
id : Integer32 = AutoNumber();
firstName: Text;
lastName: Text;
} where identity id;

I don't really have anything to say about the query part: it's quite close to LINQ.

My thoughts about M

There's quite a lot about M that I like. Mostly it seems pretty clean. The type system is powerful. Structural typing is clearly the right approach for something like M. I like the way that constraints that can be checked statically are seamlessly blended with constraints that will need to be checked dynamically. Static type checking is good, but there's no need to rub your users faces in the limitations of your static type checker. (It's interesting to see C# 4.0 moving in a similar direction.)

It's obviously very early days for M, and there's still lots of scope for improvement. Microsoft's initial implementation of M targets SQL and works a bit like a database. But clearly there's potential for using something like M in quite different contexts, e.g. for exchanging data on the Web (like what I talked about some time ago with TEDI). But several aspects of the current design seem to reflect the initial database focus. For example, the current top level wrapper of modules/entities makes sense for a database application of M, but wouldn't work so well if you were using M for exchanging data on the Web.

The spec needs some fleshing out. Microsoft's current implementation is a long way from implementing the full language. I am skeptical whether the language as currently specified can be fully implemented. For example, can you implement the test for whether one type is a subtype of another type so that it works in non-exponential time for two arbitrary types?

I see several major things missing in M, whose absence might be acceptable for a database application of M, but which would be a significant barrier for other applications of M. Most fundamental is order. M has two types of compound value, collections and entities, and they are both unordered. In XML, unordered is the poor relation of ordered. Attributes are unordered, but attributes cannot have structured values. Elements have structure but there's no way in the instance to say that the order of child elements is not significant. The lack of support for unordered data is clearly a weakness of XML for many applications. On the other hand, order is equally crucial for other applications. Obviously, you can fake order in M by having index fields in entities and such like. But it's still faking it. A good modeling language needs to support both ordered and unordered data in a first class way. This issue is perhaps the most fundamental because it affects the data model.

Another area where M seems weak is identity. In the abstract data model, entities have identity independently of the values of their fields. But the type system forces me to talk about identity in an SQL-like way by creating artificial fields that duplicate the inherent identity of the entity. Worse, scopes for identity are extents, which are flat tables. Related to this is support for hierarchy. A graph is a more general data model than a tree, so I am happy to have graphs rather than trees. But when I am dealing with trees, I want to be able to say that the graph is a tree (which amounts to specifying constraints on the identity of nodes in the graph), and I want to be able to operate on it as a tree, in particular I want hierarchical paths.

One of the strengths of XML is that it handles both documents and data. This is important because the world doesn't neatly divide into documents and data. You have data that contains documents and document that contain data. The key thing you need to model documents cleanly is mixed text. How are you going to support documents in M? The lack of support for order is a major problem here, because ordered is the norm for documents.

A related issue is how M and XML fit together. I believe there's a canonical way to represent an M value as an XML document. But if you have data that's in XML how do you express it in M? In many cases, you will want to translate your XML structure into an M structure that cleanly models your data. But you might not always want to take the time to do that, and if your XML has document-like content, it is going to get ugly. You might be better off representing chunks of XML as simple values in M (just as in the JSON world, you often get strings containing chunks of HTML). M should make this easy. You could solve this elegantly with RELAX NG (I know this isn't going to happen given Microsoft's commitment to XSD, but it's an interesting thought experiment): provide a function that allows you to constrain a simple value to match a RELAX NG pattern expressed in the compact syntax (with the compact syntax perhaps tweaked to harmonize with the rest of M's syntax) and use M's repertoire of simple types as a RELAX NG datatype library.

Finally, there's the issue of standardization. The achievement of XML in my mind isn't primarily a technical one. It's a social one: getting a huge range of communities to agree to use a common format. Standardization was the critical factor in getting that agreement. XML would not have gone anywhere as a single vendor format. It was striking that the talks about Oslo at the PDC made several mentions of open source, and how Microsoft was putting the spec under its Open Specification Promise so as to enable open source implementations, but no mentions of standardization. I can understand this: if I was Microsoft, I certainly wouldn't be keen to repeat the XSD or OOXML experience. But open source is not a substitute for standardization.

2008-11-17

What's allowed in a URI?

Java 1.4 introduced the java.net.URI which provides RFC 2936-compliant URI handling. I thought I should try to fix Jing and Trang to use this. So I've been looking through all the relevant specs to figure out to what extent I can leave things to java.net.URI.

It's convenient to begin with XLink. Section 5.4 requires the value of the href attribute to be a URI reference after certain characters that are disallowed by RFC 2396 are escaped. These are described as

all non-ASCII characters, plus the excluded characters listed in Section 2.4 of IETF RFC 2396, except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in IETF RFC 2732

If we look at 2.4.3 of RFC 2396 (why does XLink reference section 2.4 rather than 2.4.3?), we see the following sets of characters excluded:

control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
space = <US-ASCII coded character 20 hexadecimal>
delims = "<" | ">" | "#" | "%" | <">
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Section 3 of RFC 2732 (which modifies RFC 2396 to handle IPv6 addresses) does indeed allow square brackets by removing them from the 'unwise' set.

Putting these all together, we can distinguish the following categories of characters that are allowed by XLink but not allowed by RFC 2396/RFC 2732

C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents
space (#x20)
disallowed ASCII graphic characters, specifically: <>"{}|\^`
delete (#x7F)
non-ASCII Unicode characters, excluding surrogates #x80-#xD7FF, #xE000-#x10FFFF (XML does not allow #xFFFE and #xFFFF)

Looking at the various XML-related specs, things seem to be nicely aligned:

XML 1.0 First Edition required escaping just for category 5, but XML 1.0 Second Edition got fixed to use the same wording as XLink
XML Base uses the same wording as XLink
XML Schema Part 2 references XLink (in specifying xs:anyURI)
RELAX NG references XLink

XSLT 1.0 just references RFC 2396 and doesn't say anything about escaping (as regards xsl:include and xsl:import). That seems like a bug to me. Erratum E39 adds the following to the first paragraph of the spec:

For convenience, XML 1.0 and XML Names 1.0 references are usually used. Thus, URI references are also used though IRI may also be supported. In some cases, the XML 1.0 and XML 1.1 definitions may be exactly the same.

This seems to be intended to extend it to allow IRIs, though it seems like a bit of a hack: there's no reference to the IRI spec, and I don't see how it's "Thus, ". In any case, XSLT 2.0 gets it right: it references xs:anyURI.

RFC 2396 has been updated by RFC 3986. This no longer has a section describing excluded characters, but I believe I am right in saying that the set of Unicode characters that cannot occur anywhere in a URI as defined by RFC 3986 is precisely the union of my categories 1 through 5.

Next we have the IRI spec, RFC 3987. This defines:

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

It adds ucschar to the set of unreserved characters and adds iprivate to what's allowed in the query of a URI. The characters in my category 5 that are in neither ucschar nor iprivate are as follows:

C1 controls: #x80 - #x9F
the 66 Unicode noncharacters: #xFDD0 - #xFDEF, and any code point whose bottom 16 bits are FFFE or FFFF
Specials: #xFFF0 - #xFFFD; these fall into three groups, unassigned specials (#xFFF0 - #xFFF8), annotation characters (#xFFF9 - #xFFFB) and replacement characters (#xFFFC - #xFFFD)
Language tags: #xE0000 - #xE0FFF

I can buy controls and noncharacters being excluded, but the other two seem like over-engineering to me. The arguments for excluding these could equally be applied to various other weird Unicode characters. You don't want to have to change the definition of an IRI whenever Unicode adds some new weird character.

RFC 3987 also has the following in Section 3.2:

Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", "}", "|", "\", "^", and "`"

Those characters correspond to my categories 2 and 3. Overall there are a lot of subtle differences between IRIs and the thing that is currently allowed by XML specs.

Fortunately there is a draft of a new version of the IRI spec. This introduces Legacy Extended IRI (LEIRI) references, which defines ucschar as:

   ucschar        = " " / "<" / ">" / '"' / "{" / "}" / "|"
                     / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
                     / %xE000-FFFD / %x10000-10FFFF

which exactly corresponds to my categories 1 to 5.

LEIRIs seem like a very useful innovation. XML-related specs such as RELAX NG that referenced or incorporated the XLink wording will be able to simply reference RFC 3987bis and say that URI references MUST be LEIRIs and SHOULD be IRIs.

Finally we are ready to look at java.net.URI. This allows URIs to contain an additional set of "other" characters which consist of non-ASCII characters with the exception of:

C1 controls (#x80 - #x9F)
Characters with a category of Zs, Zl or Zp

This means that if you want to give an LEIRI such as an XML system identifier to java.net.URI you first need to percent encode any of the following:

the following ASCII graphic characters: <>"{}|\^`
C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents
space (#x20)
delete (#x7F)
C1 controls (#x80 - #x9F)
Characters with a category of Zs, Zl or Zp

All except the first can be tested with Character.isISOControl(c) || Character.isSpace(c).

Note that you don't want to blindly percent encode all non-ASCII characters because that will unnecessarily make IRIs containing non-ASCII characters unintelligible to humans.

2008-11-09

Working on Jing and Trang

I've been back to working on Jing and Trang for about a month now. It would be something of an understatement to say that they were badly in need of some maintenance love.

I started a jing-trang project on Google Code to host future development. There are new releases of both Jing and Trang in the downloads section of the project site. These have been out for about 10 days, and there have been a reasonable number of downloads, and no reports of any major bugs, so I think these should be fairly solid. (Interestingly, the number of downloads of Trang have been running at about twice those of Jing.)

It's been 5 years since the last release, so what new and exciting features are there? Well, actually, in the current release, none. My work for that release was focused on two areas:

getting things to work properly with current versions of Java and other dependencies;
getting the source code structure and build system into reasonable shape.

The second was a lot of work. The code base for Jing and Trang had evolved over a number of years, incorporating various bits of functionality that were independent of each other to various degrees; its structure only made any sense from a historical perspective. The current structure is now nicely modular. I converted my CVS repository to subversion before I started moving things around, so the complete history is available in the project repository. For people who want to stay on the bleeding edge, it's now really easy to check out and build from subversion.

My natural tendencies are much more to the cathedral than to the bazaar, but I'm trying to be more open. I'm pleased to say that are already two committers in addition to myself. There's a commercial XML editor called <oXygen/>, which uses Jing and Trang to support RELAX NG. The main guy behind that, George Bina, had made a number of useful improvements. In particular, he upgraded Jing's support for the Namespace Routing Language to its ISO-standardized version, which is called NVDL (you might want to start with this NVDL tutorial rather than the spec). This is now on the trunk. The other committer is Henri Sivonen, who has been using Jing in his Validator.nu service.

My goals for the next release are:

complete support for NVDL (I think the only missing feature is inline schemas)
support for the ISO-standardized version of Schematron
customizable resource resolution support (so that, for example, you can use XML catalogs)
support standard JAXP XML validation API (javax.xml.validation)
more code cleanup

Please use the issue tracker to let me know what you would like. Google Code has a system that allow you to vote for issues: if you are logged in, which you can do with a regular Google account, each issue will be displayed with a check box next to a star; checking this box "stars" the issue for you, which both adds a vote for the issue and gets you email notifications about changes to it.

I haven't started any project-specific mailing lists yet. For developers, the issue tracker seems to be enough at the moment. For users, Jing and Trang are within the scope of the existing RELAX NG Users mailing list on Yahoo Groups.

James Clark's Random Thoughts

2008-11-23

MGrammar

Some thoughts on the Oslo Modeling Language

2008-11-17

What's allowed in a URI?

2008-11-09

Working on Jing and Trang

Labels

Blog Archive

James Clark's Random Thoughts

2008-11-23

MGrammar

Some thoughts on the Oslo Modeling Language

2008-11-17

What's allowed in a URI?

2008-11-09

Working on Jing and Trang

Subscribe To

Labels

Blog Archive