2008-11-23

MGrammar

MGrammar (Mg) is another key part of Oslo.  My first reaction to Mg was: Yet another lex/yacc clone. Yawn. But now that I've looked at it a bit more closely, there are some features I find quite interesting.

I have always found parser generators to be a bit of a pain. I think one of the reasons is that the input to the parser generator typically mixes together a declarative specification of the grammar with procedural code that does something with the parse.  There's not a clean separation between my code and the generated code.

Mg works in a rather different way.  The specification in Mg is purely declarative.  So how does it actually do anything useful?  It constructs a labeled tree (actually a DAG) that represents the result of the parse.  Mg has language constructs that allow you to control what tree gets constructed, but there's a reasonable default.

Another big difference is that it works much more dynamically than a typical parser generator.  The generated parse tree is not strongly typed: it's just nodes with textual labels.  You don't have to compile that parser ahead of time (although you can if you want).  You can just give the library your grammar and it will compile it into an efficient form; you can them apply that compiled form to an input stream and get a parse tree.

The overall programming experience seems to be much more like using a regex library: at runtime, the regex gets compiled into an executable form; executing the regex tests whether the input matches the regex; and if it does match, you get structured data out (typically an array of strings, one for each captured group).  I think this sort of programming model is much more convenient; it's particularly nice for a dynamic language which can potentially deal very conveniently with the untyped parse tree that you get from Mg.

Another interesting feature of Mg is that it has modules.  The module system together with the fact that the grammar doesn't include any procedural code opens up the possibility of reusing grammars for languages and fragments of languages. It's hard to say how useful this will actually be in practice.

There's one more feature that's worth mentioning.  You can attach annotations (called attributes) to production rules.  These annotations can have structure (the same kind of structure as the parse tree).  For example, the annotations might tell a text editor how to provide syntax aware editing features: in the Microsoft implementation, there's an annotation that the editor uses to highlight keywords.

The obvious missing feature is that there's no way to automatically go from the parse tree back into the textual form.  I assume Microsoft will fix that.

I noticed only one thing that was really broken: Mg supports Unicode, but in cases where you need to specify a single character, it requires a 16-bit code unit representing part of the UTF-16 encoding of a character, rather than a code point (in the range 0 to 0x10FFFF).  This is just wrong.  It's slightly more work to do it properly, but you really can't avoid it.  For example, you need it to properly support Unicode blocks/categories, since these blocks/categories are blocks/categories of code points not code units.

I hope we'll see some open source implementations of Mg: perhaps one in C/C++ hooked up to SpiderMonkey/V8/Python/Ruby/GNU Emacs and one in Java hooked up to Rhino/JRuby/Groovy/NetBeans/Eclipse.

There's a bigger issue lurking here.  I think Microsoft see Mg as more than just a nifty library.  It's part of their vision for a next generation application development platform, where developers become more productive by using custom DSLs rather than XML. I have mixed feelings about this. The syntax for M itself is defined using Mg, and Microsoft seems to be designing things so that much of the tooling that they build for M can easily be applied to anything with a Mg-defined grammar. The tooling seems to have quite an introspective feel to it, like a sophisticated Lisp or Smalltalk environment.  The hacker side of me finds this quite cool.

On the other hand, gratuitous syntactic diversity is not a feature.  I remember in the early days of XML, Tim Bray used to start his pitch for XML by showing a whole bunch of widely different Linux config file formats.  It was quite compelling: the lack of consistency was obviously confusing and pointless. Now I don't think anybody would suggest that XML is the right format for everything.  I wouldn't want to write programs in XML (except sometimes for XSLT :-), and after writing schemas in RELAX NG compact syntax for a while, I wouldn't want to have to go back to writing them in XML.  How do you make your platform encourage developers to use a DSL where it makes sense, and discourage them when it doesn't? Up to now, part of the answer was that libraries made it a bit easier to use XML (or some other standard format) rather than some completely custom syntax; so unless there was a substantial benefit from a custom syntax, developers wouldn't bother.  But if your platform provides tools that make it really easy to design new syntaxes, how do you avoid ending up in situation where every application has it's own private DSL?  It doesn't help users if they have to learn a new syntax for every application. Certainly when I think about interchanging data on the Web, the fewer formats the better; I definitely don't want every application to be using its own completely custom syntax.

5 comments:

Ben Lings said...

"I definitely don't want every application to be using its own completely custom syntax"

I don't think you'll need to worry about having to deal with lots of DSLs. My impression is that Microsoft is targeting in-house business applications and developers. I think there's a lot of scope for using DSLs for describing business rules in a form that is understandable by non-programmers in the business.

Unknown said...

DAG Directed Acyclic Graph
DSL Domain Specific Language.

Rob Jellinghaus said...

MGrammar in some ways is the most immediately useful part of Oslo. There's no reason you need to write a new DSL to use it. In fact, writing MGrammar descriptions of formats you already use can be a very big win.

I see using Oslo DSLs more for business rules and other algorithmic/declarative information that doesn't really have a lot of interoperability implications. It's not targeted at data communication per se. (Though it's no slouch at expressing protocols, if you want to use it for that.)

Anonymous said...

"I definitely don't want every application to be using its own completely custom syntax"

You already have to do this to some degree. When you start working for a new company or a new department in your current company you have to wade through it's terminology how all that's cooked into the code. Also when you bring in a 3rd party API. Or even when you create an API for other developers to use.

Hopefully the MGrammers would reduce noise and make it easier for us to learn and use the APIs.

const said...

You might be interested to check my ETL framework. I'm still working on web site, but documentation is available as a part of distributive. I have just released the version 0.2.1.

Just download http://downloads.sourceforge.net/etl/etl-java-0_2_1-xmlout.zip
and open etl-java-0_2_1\doc\readme.html inside it. There is a tutorial that could help to get language basics.

This language definition mechanism is designed to allow grammar reuse but at the cost of universality. And it also defines passive grammars. It is designed to be orthogonal to metamodel issue. It is easy to establish mapping to existing metamodels. From what have been actually implemented: JavaBeans, EMF, and simple model that uses reflection over classes and fields.