James Clark's Random Thoughts

Why Ballerina is a language

2022-05-04T07:32:00.000+07:00

A new programming language is a lot of work, and the chances of any new programming language getting traction are small. Many new languages are created. Very few make it.

Perhaps even more work than the language is the platform - all the other things that are needed to make users productive with the language: standard library, package manager, IDE, debugger, documentation system, testing tools, etc. One way to reduce the cost of a new language is to leverage an existing platform, such as Java or .NET, and rely on that platform for some of the needed functionality.

Ballerina is a new programming language, and is also a platform. Although it's implemented on top of the JVM, it does not embrace the JVM. It is designed with the goal that we can do another implementation that does not use the JVM, and user code will run unchanged. (We do provide JVM interop features, but that is specifically for when you want to interop with existing JVM code.)

This raises an obvious question. Why? In this post, I want to address this question by explaining what there is in the Ballerina language and platform that could not be done except with a new language and platform. These can be grouped into three areas:

networking: networking abstractions, which are part of the language, and implementations of those abstractions provided by the standard library;
data and types: the kinds of values that the language operates on and the ways that the type system provides to describe these;
concurrency: how the language enables the program to describe concurrent execution of code and control concurrent access to mutable state

These three areas are fundamental: they could not be grafted onto another language. They are also deeply interconnected. In addition, there are some supporting features that are not so fundamental, but which together provide significant value.

This blog is not a complete answer to the "Why?" question. Ballerina's development is funded by WSO2 and WSO2's ultimate goal is to create a product that is useful to its customers. But Ballerina is not itself the product: both Ballerina the language and Ballerina the platform are free and open source. The product is separate: it's a cloud service that takes advantage of Ballerina's capabilities.

Network abstractions

The Ballerina language provides abstractions for both network services and network clients, but knows nothing about specific protocols. Protocol-specific library code is needed to make these abstractions available for a specific protocol. The standard library includes this for the following protocols:

HTTP
GraphQL
gRPC
WebSocket
WebSub
FTP
Sockets (TCP/UDP)
Messaging (AMQP, Kafka, NATS, IMAP4, POP3)

The network abstractions for clients are more straightforward than for services. For clients, the network abstraction consists of a distinct kind of object, called a client object, which has a distinct kind of method, called a remote method, which represents outbound network messages. The standard library supports a protocol by providing an implementation of a client object for that protocol. The language provides a distinctive syntax for remote method calls on client objects and syntactically restricts where such calls can appear. This enables the Ballerina VS Code extension to provide a graphical view of a function or program, which uses a sequence diagram to show the interactions between client objects and remote services. This graphical view always remains in sync with the textual view, and both views are editable.

For services, Ballerina provides a distinct kind of object, called a service object. A remote method on a service object represents a network-callable method. Incoming network messages are dispatched to service objects by using objects implementing the language-defined Listener type. The standard library supports a protocol for services by providing an implementation of the Listener type for the protocol. The language also provides convenient syntax for a module to construct a Listener object, and to define a service object and attach it to a Listener object.

For many languages, the execution model of a program is simply to call a function, which represents the entry point of the program. In Ballerina, the services defined by a program's modules are the network entry points of the program, and this is incorporated into the execution model of a Ballerina program. When a program is executed, every module will first be initialized; this will construct and connect up the module's Listener and service objects. After all modules have been initialized, the program enters a listening phase, which makes the Listeners start accepting network input. The execution model also deals with shutting down services.

The language-provided network abstractions make a program's interaction with the network explicit. This is used to provide network observability. It is also the basis for the code-to-cloud support, which uses compiler extensions to generate artifacts needed for deployment to different cloud platforms (K8s, Azure, AWS). This is also the

Service objects support the concept of a resource method, which enables a more data-oriented view of services. This can be thought of as a network-oriented generalization of OO getter/setter methods, where get/set is generalized to the protocol-defined method (e.g. the HTTP method name get/put/post) and the property name is generalized to a path. The standard library provides implementations of this for HTTP and GraphQL services. This avoids the pain that comes from having to artificially combine the HTTP method name and the resource path into a single identifier (what OpenAPI calls the operationId). We are working on extending the resource method concept to client objects.

Normally when a service and client use a request-response message exchange pattern the remote method on the service can use its return value to provide its response to a request. But this is not always sufficient: in some cases, the service may want to control what happens if there is an error in sending the response; in other cases, they may be using a more complex message exchange pattern. Ballerina models this by passing a client object as an argument to the service's remote method; the service's remote method calls remote methods on this client object to send messages back to the client.

Data and types

Plain data

One of the most fundamental aspects of Ballerina is its focus on plain data. This is called anydata in Ballerina and is analogous to the POD (Plain Old Data) concept in C++. It is pure data, independent of processing that might be applied to the data.

Messages exchanged by network protocols are represented by plain data; the implementations of network protocols can automatically serialize plain data in a format appropriate to the protocol. In particular, plain data can be directly serialized to and from JSON in a simple, natural way.

The whole Ballerina platform is designed to maximize use of plain data. Objects, which bundle methods with data, are not plain data, and the platform uses plain data rather than objects, unless the specific functionality provided by objects is needed. Services and clients are represented as objects; the parameters and return values of remote methods are plain data.

Structured data throughout the platform is represented using the built-in map and array types, which are plain data, rather than using library-defined collection types. In addition to maps and arrays, Ballerina provides a built-in table type, which allows for collections with arbitrary plain data keys (maps have only string keys as in JSON); tables are automatically transformed into arrays of objects when serializing to JSON. The table type provides enough power that even sophisticated, complex programs can be written using only the language-provided collection types.

Ballerina has a structural type system, which has several features typically found in schema languages, such as unions and open records. The overall result is that Ballerina types for plain data work well as schemas for network messages. Subtyping is simple and flexible because it is semantic: types are thought as sets of values, with subtyping corresponding to the subset relationship between the corresponding sets. For example, a user-defined record type is a map. This allows the platform to easily convert between user-defined types and generic types (like anydata). Converting to the generic type is a no-op, because of the subtype relationship. In the other direction, the platform uses a language capability (similar to a type cast) to validate and convert the value to a user-defined type.

Types for services

The user defines a service by writing resource methods or remote methods. For the HTTP and GraphQL protocols, the user can define types (most often record types) and use them for the parameters and return values. The platform's Listener implementation for the protocol makes this just work: the incoming messages will be validated and converted using the parameter types. Annotations can be used to fine-tune this, for example to control whether a method parameter should come from a query parameter or the payload.

The platform can also use the service definitions to generate an IDL. For HTTP, this would be OpenAPI. The types specified for the parameters and return value are converted to a JSON schema.

This works for GraphQL in a similar way: the GraphQL Listener exposes a GraphQL service; it constructs the GraphQL schema for this service from the types in the resource methods; GraphQL introspection is used to make the schema available to clients at runtime.

For gRPC, the platform uses an IDL-first approach (the gRPC community's preferred approach). The platform allows a Ballerina service definition stub to be generated from the gRPC service definitions.

Types for clients

The platform uses two approaches to allow clients to work with typed data. The remote method on the generic client class can at runtime convert the response to a user-specified type passed as an argument.

Alternatively, the platform can generate an application-specific client class from the service's IDL. Note that this supports the same graphical view as the generic client. For GraphQL, the platform can generate an application-specific typed client using a user-specified set of GraphQL queries.

Concurrency

The graphical view of a function as a sequence diagram provided by the VS Code extension shows not only how the function accesses network services, but also the concurrent logic of the function. The language's worker-based concurrency primitives are designed for this graphical view. In the sequence diagram, each worker is represented by a vertical lifeline, and a message passed between workers is represented by a horizontal arrow between the corresponding lifelines. The textual representation is more complex, and requires the compiler to pair up sends and receives. The compiler can also detect potential deadlocks. These primitives have limited expressiveness compared to the concurrency primitives offered by most languages, but are much easier and safer to use in cases where this expressiveness is sufficient.

Ballerina allows programmers to make use of shared mutable state in a familiar way, yet the platform also allows user-defined services to be executed in parallel, with a compile-time safety guarantee that this will not cause data races. This leverages a combination of language features: a simple locking primitive, read-only types, and a concept of isolation. The last of these is a complex, multi-faceted feature, but the compiler can infer it within a single module. The overall effect is the compiler can check whether a service's access to mutable state is always properly locked; if it is, then the Listener implementation allows parallel execution of that service; if not, then the compiler can tell the user that they need to add locks.

The platform uses asynchronous IO throughout, but this is not exposed to the programmer. Async functions are not distinguished as a separate kind of function. The programmer can instead think in terms of logical threads of control, which Ballerina calls strands; these are similar to virtual threads proposed for Java, or goroutines in Go.

Supporting features

Transactions

The language provides transaction-related features, which make it easier to code robust transaction logic and enable some logic errors to be caught at compile-time. (Note that this is not transactional memory.) These rely on their being a transaction manager provided by the runtime and standard library.

The language accommodates distributed transactions by allowing service and client remote and resource methods to be transaction-aware. It also supports a form of compensation by allowing participants in a distributed transaction to register code to be run when a distributed transaction completes. What makes distributed transactions work is not so much the language as the runtime and standard libraries: these provide a distributed transaction manager and support for transactions in the HTTP listener and client implementation..

Databases

The standard library provides support for accessing SQL databases. A SQL database is accessed using a client object. Database transactions integrate with the language's transaction features by making the remote methods on the SQL client object be transactional.

Data with types defined by the SQL schema are transformed into Ballerina values having user-defined record types using the same language features that other network clients use to transform data received from the server.

SQL queries are represented using a template feature similar to JavaScript template literals. This allows Ballerina values representing query parameters to be automatically converted to SQL values.

The language-provided stream type is used to return the results of a query. The language-integrated query feature can be applied to streams directly to allow for further program code to further refine the query results or combine them with the results of queries from other databases, without having to keep the full result in memory.

Configuration data

Most real-life programs need access to configuration data at runtime. Ballerina has language support for this. The language support consists just of allowing specific module-level variables to be declared as configurable; there can be a default for the value or it can be required to be specified in the configuration. The runtime uses a TOML file to initialize configurable variables.

Although this is a very simple language feature, it combines with other Ballerina language features (types, plain data, read-only) to provide a powerful capability: the structure and type of all configuration input to a program is known at compile time, which greatly facilitates the management of the data by higher-level layers.

Language-integrated query

Ballerina provides a language-integrated query feature, which is a generalization of the list comprehensions found in many programming languages. The syntax is similar to C# LINQ declarative query syntax. But whereas the semantics of the C# LINQ syntax are defined in terms of a desugaring into method calls, the semantics of the Ballerina query syntax (which are inspired by XQuery FLWOR expressions) are defined directly in terms of operations on Ballerina's built-in collection types.

The table collection type and query are designed to work nicely together. Tables are similar to lists of records with a primary key. List comprehensions can be extended to handle these more smoothly than maps, where the key and value are separate. Queries have a join clause that turns into a hash join when used with tables.

Query allows many data transformations to be written in a declarative way, using expressions rather than statements, which enables a graphical user interface based on data flow.

XML

Ballerina has a separate xml data-type, modeled after XQuery, which also counts as plain data.

The platform supports two ways of serializing xml values. When the entire network message is XML, then the xml value is serialized as an XML document. When an xml value is included within a structure serialized as JSON, the xml value is serialized as a JSON string. This is convenient when the xml value is being used to represent HTML.

The language-integrated query feature also works with xml: XML structures can be used as input and/or output to a query. This combines with a specialized XPath-like XML-navigation syntax.

Future

The long-term vision for Ballerina includes a number of important features which are not yet implemented, but which have a foundation in existing language features.

Event streams (unbounded streams of records with timestamps). Be able to both generate them and query them (using various kinds of windows). Related to this is more first-class support for a client subscribing to a stream of events from the server.
Network security. Language support to help the user avoid network security problems (we have experimented with a feature similar to tainting in Perl); this can leverage the explicitness of network interactions in Ballerina.
Service choreography. Be able to write a single description that describes how multiple services interact and use that to derive the types of individual services. This could handle services implemented in other programming languages by using Ballerina service types as an IDL.
Workflow. Support long-running process execution. Be able to suspend a program and later resume it as a result of an incoming network message. This also requires that transactions get better support for compensation.

Ballerina Programming Language - Part 1: Concept

2019-09-12T19:27:00.000+07:00

In the previous post, I talked about the context for Ballerina. In this post, I want to explain what kind of programming language it is. We can summarize this as a number of design goals:

Provide abstractions for networking
Use sequence diagrams as the visual model
Minimize cognitive load
Leverage familiarity
Enable a semantically-rich static program model
Provide a complete platform, not just a language
Allow multiple implementations, based on different runtime environments

This is not exactly the list we started out with: it has evolved in the light of experience.

I will talk about each of these points in turn. The first two of these are a bit different. They correspond to the two fundamental features that make Ballerina unique.

Networking

The primary function of an ESB is to send and receive network messages. So language-level support for this is central to the Ballerina project.

On the sending side, the key abstraction is a remote method. A remote method is part of a client object; there is a distinct syntax for calling remote methods. A program sends a message by calling a remote method; the return value of the remote method can describe the response to that message. Implementation of remote methods is typically provided by library code or is auto-generated; each protocol will have its own client object implementation. Application code calls remote methods on client objects..

On the receiving side, the key abstraction is a resource method; a resource method is part of a service. Application code provides services by implementing resource methods. This works in conjunction with listener objects. Implementation of listener objects is typically provided by library code; each protocol will have its own listener object implementation. Listener objects call resource methods on services provided by application code.

There's a final twist that ties together the sending and receiving side: resource methods are typically passed a client object to allow them to send messages back to the client.

So what is gained by providing language-level support for networking?

Most importantly, it enables a visual model that shows the program's behaviour in terms of these abstractions: the visual model can show how the program interacts using network messages. This links up with the second unique feature of Ballerina - the use of sequence diagrams as the visual model.

It also provides a purpose-designed syntax, which does not require the developer to jump through a series of hoops. You use the language-provided syntax and it just works. It's as easy as writing a function. This is not all that important for a large program. But many programs that perform integration tasks are small, and with small programs reducing the ceremony matters. You could compare this with how AWK takes care of opening files and iterating over each line of the file: it's not hard to do, but for a small program the fact that AWK takes care of this for you is a significant convenience.

Related to this is that Ballerina's model of program execution incorporates the concept of running as a service. You don't have to write an explicit loop waiting for network requests until you get a signal. The language runtime deals with all that for you. Again, not revolutionary, but it makes a difference.

The final advantage relates to typing. At the moment the type of a resource method is just a function type, and the type of a service is just a collection of these function types. But we want to do better than this. The type of a resource method should capture not just the type of its parameters and return value, but the type of the messages that it expects to receive, and the type of the message that it will send in response. (The former is at the moment partially captured by annotations, which can be generated from, for example, Swagger/OpenAPI descriptions) Furthermore, the type of a service should capture not just the type of each message exchange separately, but also the relationship between the exchanges. This is usually called session typing and is an active area of research.

Sequence diagrams

WSO2's experience from working with customers over many years has been that drawing a sequence diagram is typically the best way to describe visually how services interact. An ESB's visual model is typically based on dataflow model, which works well for simple cases but is not as expressive. So one big idea underlying Ballerina is that you should be able to visualize a function or program as a sequence diagram.

It is important to understand that the visualization of Ballerina code as a sequence diagram is not simply a matter of tooling that is layered on top of the Ballerina language. It took me a long time to really grok Sanjiva's concept for how the language relates to sequence diagrams. My initial reaction was that it seemed to me like a category error. Sequence diagrams are just a kind of picture. What's that got to do with the syntax and semantics of a programming language?

The concept is to design the syntax and semantics of the language's abstractions for sending network messages, for in-process message passing and for concurrency so that they have a close correspondence to sequence diagrams. This enables a bidirectional mapping between the textual representation of a function in Ballerina syntax and the visual representation of the function as a sequence diagram. The sequence diagram representation fully shows the behaviour of the function as it relates to concurrency and network interaction.

The closest analogy I can think of is Visual Basic. The visual model of a UI as a form is integrated with the language semantic to make writing a Windows GUI application much easier than before. Ballerina is trying to do something similar but for a different domain. You could think of it as Visual Basic for the Cloud.

Cognitive load

Programming languages differ in the demands they make of a programmer. One way to look at this is in terms of different developer personas, such Microsoft's Einstein, Elvis and Mort personas. But it's hard to do that without implying that one kind of developer is inherently superior to another, and I don't think that's a helpful way to look at things. I prefer to think of it like this: a programming language both gives and takes. It gives abstractions to make it convenient to express solutions, and it gives the ability to detect classes of errors at compile time. But it takes intellectual effort to understand the abstractions that are provided and to fit the solution into those abstractions. In other words, it relieves the programmer of some of the cognitive load required to write and maintain a program, but it also imposes its own cognitive load. Every programming language needs to strike a balance between what it gives and what it takes that is appropropriate for the kind of program for which it is intended to be suitable.

For Ballerina, the goal has been for it to make only modest demands of the programmer. Integration tasks are often quite mundane; people just want to get things working and move on. But these integrations, although mundane, can be critically important to a business: so they need to be reliable and they need to be maintainable. So the language tries to nudge programmers in the direction of doing things in a reliable and maintainable way.

Familiarity

One way to reduce cognitive load is to take advantage of people's familiarity with programming languages. Specifically, Ballerina tries to take advantage of familiarity with programming languages in the C family, such as C, C++, Java, JavaScript and C#. This applies to both syntax and semantics. It is not a hard and fast rule, but a guideline: don't be different from C without a good reason, and elegance does not by itself count as a good reason. A good example would be the rules for operator precedence: the C rules are quite a bit different from what I would design if I was starting from scratch, but the benefits from better rules just aren't enough to make it worth being different from all the other languages in the C family.

Semantically-rich static program model

I have struggled to find the right phrase to describe this. It is a generalization of static typing. The idea is that the language should enable programs to describe their semantic properties in a machine-readable way. The objective is to enable tools to construct a model of the program that incorporates these properties, and then use that model to help the developer write correct programs. This ties up with the cognitive load point. A semantically-rich model enables more powerful tools, which help reduce the effective cognitive load on the programmer

"Static" means that a tool can build a model of the program just by analysing the source code, without needing to execute the program. Often this is called "compile-time", but that doesn't seem appropriate for the way an IDE will use this model. Visual Basic used to call it "design time", but that seems a bit narrow too: continued maintenance is just as important as initial design.

For types, it means we want a static type system. But our approach to static typing is pragmatic. The static type system is there to help the programmer. We don't want the static system to be so sophisticated or so inflexible that it becomes an obstacle to writing programs. The goal is not to statically type as much as possible, but to statically type to the extent that it is likely to be helpful to the programmer writing the kinds of program for which we intend Ballerina to be used.

Types are just one kind of semantic richness. There are many others.

Sequence diagrams depend on building a model of the program where sends and receives are matched up.
Documentation that is structured, not simply free-form comments, can be checked for consistency with the program, and can be made available through the IDE.
Properties of services and listeners can be used to automate deployment of Ballerina programs to the cloud.

Platform

There's a distinction between the core of a programming language, which defines the syntax of the language and the semantics of that syntax, and the surrounding ecosystem. Often the core comes first, and the ecosystem develops organically as the core gains popularity. A lot of the utility of any language comes from the surrounding ecosystem.

In Ballerina, we refer to the core as "the language" - it's the part that's defined in the language specification. With Ballerina, the language has been designed in conjunction with key components of the surrounding ecosystem, which we call the "platform".

The platform includes:

a standard library
a centralized module repository, and the tooling needed to support that
a documentation system (based on Markdown)
a testing framework
extensions/plug-ins for popular IDEs (notably Visual Studio Code).

This all takes a lot of work, and is a big factor in why Ballerina has required such a large investment of resources from WSO2.

Multiple implementations

Although the current implementation of Ballerina compiles to JVM byte codes, Ballerina is emphatically not a JVM language. We are planning to do an implementation that compiles directly to native code and we've started to look at using LLVM for this. I suspect that an implementation targeting WebAssembly will also be important long-term.

We have been careful to ensure that the language semantics, particularly as regards concurrency, are not tied to the JVM. This was part of the motivation for the initial proof-of-concept implementation approach, which compiled into bytecode for its own virtual machine (BVM), which was then interpreted by a runtime written in Java. Although the 1.0 implementation compiles directly to the JVM, it is not an entirely straightforward mapping; it takes some tricks to implement the Ballerina concurrency semantics (similar to how Kotlin implements coroutines).

There are languages that are defined by an implementation and there are languages defined by a specification. For a language with multiple implementations, it is much better if the language is defined by a specification, rather than by the idiosyncrasies of a particular implementation. The Ballerina language is defined by its specification. This specification does not in any way depend on Java.

Initially, the specification was a partial description of the implementation. But now we have evolved to a situation where the implementation is done based on the specification. From a language design point of view, we are ready for multiple implementations. It is "just" a matter of finding resources to do the implementation. One of my hopes in writing this sequence of blog posts is somebody outside WSO2 will feel inspired to their own implementation. We would be more than happy to work with anybody who wants to take this on.

Conclusion

There has long been a distinction, originally due to John Ousterhout, between systems programming languages, and scripting or glue languages. Systems programming languages are statically typed, high performance and designed for programming in the large. Scripting/glue languages are dynamically typed, low performance and designed for programming in the small.

There are elements of truth in this distinction, but I see it more as a spectrum than as a dichotomy, and I see Ballerina as being somewhere in the middle of that spectrum. It has static typing, but it's much less rigid than the kind of static typing that systems programming languages have. It is capable of decent performance: it should be possible to make it quite a bit faster than Python, but it will never rival Rust or C++. It's not designed for programs with hundreds of thousands of lines of code, but it's also not designed for one-liners. Here's how I would place Ballerina on the spectrum relative to some other languages:

Assembly
Rust, C
C++
Go, Java, C#
Ballerina
TypeScript
Python, JavaScript
PowerShell, Bourne shell, TCL, AWK

Go is a bit hard to place relative to Java/C#. In some ways, it's more on the systems side (no VM); in some ways, it's more on the scripting side (typing). I would put Ballerina between Go and TypeScript.

In future posts, I will get into the concrete language features that these design goals have led us to. The details of the core language are in the language specification. The rest of the platform does not yet have proper specifications, but there is lots of documentation on the web site.

Ballerina Programming Language - Part 0: Context

2019-09-11T08:40:00.000+07:00

Well, it's been 9 years since my last blog post. It's been an eventful period on real life: I got married, we have two children, I became a Thai citizen, built a house and had major back surgery.

For the last 18 months, I have been working on the design of a new programming language called Ballerina. Version 1.0 of Ballerina has just been released, so now is a good time to start explaining what it's all about. In subsequent posts, I will delve into the technical details, but in this post I want to provide some context: the "who" and the "why".

TL;DR Ballerina was designed to be the core of a language-centric, cloud-native approach to enterprise integration. It comes from WSO2, which is a major open source enterprise integration vendor. I have been working on the language design and specification. I think it has potential beyond the world of enterprise integration.

The main person behind Ballerina is Sanjiva Weerawarana. I've known Sanjiva since around 1999 (20 years!), when we were both on the W3C XSL WG doing XPath and XSLT 1.0. Sanjiva at that time was working for IBM Research (where his boss at one point was Sharon Adler, who I had worked with on the ISO DSSSL committee).

This was the era of peak XML, before JSON was invented, and people were using XML for all sorts of things for which it was not very well-suited, including SOAP and the whole Web Services stack built on top of that. Sanjiva worked on several important parts of that including WSDL and BPEL.

Around 2005, Sanjiva decided he wanted to leave IBM and start a company with some fellow IBMers. He is from Sri Lanka, and wanted to go back. At that time, I was working for the Thai government. I had persuaded them to start an open source promotion activity, and I was running that for them (one day I should write a blog about that).

On Boxing Day 2004, there was a huge tsunami in the Indian Ocean, which was a disaster for several countries including Thailand and Sri Lanka. As part of the recovery process, the Thai government had organized an international IT conference in Phuket at the beginning of 2005. Sanjiva came to talk about Sahana, which was an effort started in Sri Lanka to use open source to help with recovery from the tsunami.

On the sidelines of the conference Sanjiva pitched me the idea for the company, at that time called Serendib Systems (the word serendipity comes from Serendip, which is an old name for Sri Lanka). The idea was to do open source related to web services, based in Sri Lanka. It was at the intersection of a number of my main interests at the time (XML and open source in developing countries), and I had confidence in Sanjiva, so it wasn't a hard decision to invest.

The name was changed to WSO2 (WS as in web services, O2 as in oxygen), Sanjiva took the role of CEO and I joined the board. WSO2 has grown steadily in the 14 years since it was founded, and now has about 600 employees. It has remained an open source company and it has developed a comprehensive open source enterprise integration platform. You may well never have heard of WSO2; we have always been rather better at the technical side of things than the marketing side. But we are actually a major vendor in the open source enterprise integration space, with lots of global Fortune 500 customers. In fact, there’s some Gartner report that says we are the world’s #1 open source integration vendor, although I’m not quite sure on what metric.

For quite some time, the workhorse of enterprise integration has been the Enterprise Service Bus (ESB). An ESB sends and receives network messages over a variety of transports, and there is a configuration language, typically in XML, that describes the flow of these messages. The configuration language can be seen as a domain-specific language (DSL) for integration. It supports abstractions like mediators, endpoints, proxy services and scheduled tasks, which allow a given message flow to be described at a higher-level than would be possible if the equivalent code were written in a programming language such as Java or Go. ESB products (including WSO2's) typically include a GUI for editing the configuration language. The ESB's higher-level abstractions allow for a much more useful graphical view than would be possible with a solution that was written in a programming language.

The fact that an ESB is not a full programming language has important consequences. It means that at a certain point you fall off a cliff: there are things you simply cannot express in the XML configuration language. ESBs typically deal with this by allowing you to write extensions in Java. In practice, this means that complex solutions are written as a combination of XML configuration and Java extensions. This creates a number of problems. First, the ESB is tied to Java. 10 years ago that wasn't really a problem, but increasingly Java is the new COBOL. The cool kids are interested in Go, TypeScript or Rust and would not even consider Java. Oracle's stewardship of Java does not help. Second, the Java extensions are a black box as far as the graphical interface is concerned. Third, multiple languages creates additional complexity for many aspects of the software development process: build, deployment, debugging. Fourth, it's bad in terms of the cognitive load that it places on the developer team: the developers have to learn two quite different languages, and continually switch gears between them.

The other fundamental problem with the ESB concept is that is designed for a centralized deployment model. The idea is that the IT department of an enterprise runs the Enterprise Service Bus for the entire enterprise. It is not only the large footprint of an ESB that pushes in this direction, but also the licensing model: ESBs are typically not cheap and are licensed on a per-server basis. If you think of the XML configuration language as a domain-specific programming language, and of the ESB as the runtime for that language, you in effect have one large program, controlling integration across the entire enterprise. Furthermore, this program is not written in a pleasant, modern programming language, with support for modularity, but is rather just a pile of XML. As you can imagine, this is not good for agility or DevOps.

This is the background that led to the creation of Ballerina. The high-level goal is to provide the foundation of a new approach to enterprise integration that is a better fit for current computing trends than the ESB. Obviously, the cloud is a hugely important part of this. The Ballerina concept evolved over a number of years. I see three stages:

Let’s do a better DSL that looks more like a programming language!
Let’s make it full programming language!
Let’s take a shot at becoming a mainstream programming language!

Stage 2 marks the start of the Ballerina project, and was when the name was chosen; that happened in August 2016.

My first involvement with Ballerina was at the beginning of 2017, when Sanjiva asked me to help with the design of the language support for XML. But I only started to get really deeply involved in Ballerina in February 2018. At that point there was already a working, proof-of-concept implementation. Sanjiva asked me to help write a language specification.

When I started, we did not think it would take all that long for me to write a specification. We were completely wrong about that! It's been 18 months already, and it is still a work-in-progress. What happened is that as we dug into the details of the language, it became apparent that there was a lot of scope for improvement in the design. The job turned out to be more about refining and evolving the language design, rather than just documenting what had been implemented. As it became clearer than the goal was to try eventually to become a mainstream programming language, so the quality bar for the implementation needed to be raised.

Sanjiva's primary area of expertise is distributed systems, and WSO2's collective expertise is centered around enterprise middleware, rather than programming language design and implementation. When they started the Ballerina project, I think they underestimated the enormity of the project that they had taken on, as did I to some extent. As I have been wrestling with the Ballerina language design, I have gained a much better appreciation of just how hard programming language design is. I have looked at many other programming languages for inspiration. I've been incredibly impressed by how good the current generation of programming languages are. I would particularly highlight TypeScript, Go, Rust and Kotlin. Each of them has a very different language concept, but every one of them has done an amazing job of designing a programming language that realizes their concept. I take my hat off to their designers.

I should say something about what a version of 1.0 means. I should first explain first we make a distinction between the implementation version and the specification version. 1.0.0 is the implementation version. Language specifications are labelled chronologically (it's a living standard!). The 1.0.0 implementation is based on the language specification labelled 2019R3, which means the 3rd release of 2019.

1.0 does not mean that we have got either the language design or implementation to where we want it. If we lived in a world unsullied by commercial or competitive reality, we could easily spend a couple of years extending and improving the design and implementation. But WSO2 is not a huge company, and we have already made a very substantial investment in Ballerina (of the order of 50 engineers over 3 years). So we need to get something out there, so that we can get some proof points to justify continued investment. The benchmark for 1.0 is whether it works better for enterprise integration than our current ESB-based product. It needs to be sufficiently stable and performant that we can support it in production for enterprise customers.

We also have a reasonable degree of alignment between the language specification and the compiler: what the compiler implements is a subset of what the specification describes, with a couple of caveats. The first caveat is that there are some non-core features that are not quite stable. These are labelled "preview" in the specification. We expect to stabilise these soon, and that will involve some minor incompatible changes. The second caveat is that the implementation has some experimental features, which are not in the specification; we plan that the language will eventually include features that provide similar functionality.

The language design described by the current specification has two fundamental features that are unique (at least not part of any mainstream programming language). Its combination of other features is also unique: each feature is individually in some language, but no language has all of them. I think the language design is interesting not just for enterprise integration, but for any application which is mainly about combining services, whether consuming them or providing them. As things move to the cloud, more and more applications will fall into this category. Although the current state of the language design is interesting, I think the potential is even more interesting. Over the next year or two, we will stabilize more of the integration-oriented language features, which will make Ballerina quite different from any other programming language. Unfortunately, it takes a lot of work to get the general-purpose features solid and that has to be done before the more domain-specific features can be finalized.

Overall, the 2019R3 language design and the 1.0 implementation are an initial, stable step, but there is still a long way to go.

In future posts, I will get into the design of the language. In the meantime, you can try out the implementation and read the specification. The design process was initially quite closed, but has gradually become more open. Most of the discussion on the spec happens in issues in the spec's GitHub repository. Major new language features have public proposals. Comments and suggestions are welcome; the best way to provide input on is to open a new issue.

See the next post in the series.

More on MicroXML

2010-12-18T10:03:00.001+07:00

There's been lots of useful feedback to my previous post, both in the comments and on xml-dev, so I thought I would summarize my current thinking.

It's important to be clear about the objectives. First of all, MicroXML is not trying to replace or change XML. If you love XML just as it is, don't worry: XML is not going away. Relative to XML, my objectives for MicroXML are:

Compatible: any well-formed MicroXML document should be a well-formed XML document.
Simpler and easier: easier to understand, easier to learn, easier to remember, easier to generate, easier to parse.
HTML5-friendly, thus easing the creation of documents that are simultaneously valid HTML5 and well-formed XML.

JSON is a good, simple, extensible format for data. But there's currently no good, simple, extensible format for documents. That's the niche I see for MicroXML. Actually, extensible is not quite the right word; generalized (in the SGML sense) is probably better: I mean something that doesn't build-in tag-names with predefined semantics. HTML5 is extensible, but it's not generalized.

There are a few technical changes that I think are desirable.

Namespaces. It's easier to start simple and add functionality later, rather than vice-versa, so I am inclined to start with the simplest thing that could possibly work: no colons in element or attribute names (other than xml:* attributes); "xmlns" is treated as just another attribute. This makes MicroXML backwards compatible with XML Namespaces, which I think is a big win.
DOCTYPE declaration. Allowing an empty DOCTYPE declaration <!DOCTYPE foo> with no internal or external subset adds little complexity and is a huge help on HTML5-friendliness. It should be a well-formedness constraint that the name in the DOCTYPE declaration match the name of the document element.
Data model. It's a fundamental part of XML processing that <foo/> is equivalent to <foo></foo>. I don't think MicroXML should change that, which means that the data model should not have a flag saying whether an element uses the empty-element syntax. This is inconsistent with HTML5, which does not allow these two forms to be used interchangeably. However, I think the goal of HTML5-friendliness has to be balanced against the goal of simple and easy and, in this case, I think simple and easy wins. For the same reason, I would leave the DOCTYPE declaration out of the data model.

Here's an updated grammar.

# Documents
document ::= comments (doctype comments)? element comments
comments ::= (comment | s)*
doctype ::= "<!DOCTYPE" s+ name s* ">"
# Elements
element ::= startTag content endTag
          | emptyElementTag
content ::= (element | comment | dataChar | charRef)*
startTag ::= '<' name (s+ attribute)* s* '>'
emptyElementTag ::= '<' name (s+ attribute)* s* '/>'
endTag ::= '</' name s* '>'
# Attributes
attribute ::= attributeName s* '=' s* attributeValue
attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"'
                 | "'" ((attributeValueChar - "'") | charRef)* "'"
attributeValueChar ::= char - ('<'|'&')
attributeName ::= "xml:"? name
# Data characters
dataChar ::= char - ('<'|'&'|'>')
# Character references
charRef ::= decCharRef | hexCharRef | namedCharRef
decCharRef ::= '&#' [0-9]+ ';'
hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'
namedCharRef ::= '&' charName ';'
charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'
# Comments
comment ::= '<!--' (commentContentStart commentContentContinue*)? '-->'
# Enforce the HTML5 restriction that comments cannot start with '-' or '->'
commentContentStart ::= (char - ('-'|'>')) | ('-' (char - ('-'|'>')))
# As in XML 1.0
commentContentContinue ::= (char - '-') | ('-' (char - '-'))
# Names
name ::= nameStartChar nameChar*
nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D]
                | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF]
                | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
# White space
s ::= #x9 | #xA | #xD | #x20
# Characters
char ::= s | ([#x21-#x10FFFF] - forbiddenChar)
forbiddenChar ::= surrogateChar | #FFFE | #FFFF
surrogateChar ::= [#xD800-#xDFFF]

MicroXML

2010-12-13T15:57:00.001+07:00

There's been a lot of discussion on the xml-dev mailing list recently about the future of XML. I see a number of different possible directions. I'll give each of these possible directions a simple name:

XML 2.0 - by this I mean something that is intended to replace XML 1.0, but has a high degree of backward compatibility with XML 1.0;
XML.next - by this I mean something that is intended to be a more functional replacement for XML, but is not designed to be compatible (however, it would be rich enough that there would presumably be a way to translate JSON or XML into it);
MicroXML - by this I mean a subset of XML 1.0 that is not intended to replace XML 1.0, but is intended for contexts where XML 1.0 is, or is perceived as, too heavyweight.

I am not optimistic about XML 2.0. There is a lot of inertia behind XML, and anything that is perceived as changing XML is going to meet with heavy resistance. Furthermore, backwards compatibility with XML 1.0 and XML Namespaces would limit the potential for producing a clean, understandable language with really substantial improvements over XML 1.0.

XML.next is a big project, because it needs to tackle not just XML but the whole XML stack. It is not something that can be designed by a committee from nothing; there would need to be one or more solid implementations that could serve as a basis for standardization. Also given the lack of compatibility, the design will have to be really compelling to get traction. I have a lot of thoughts about this, but I will leave them for another post.

In this post, I want to focus on MicroXML. One obvious objection is that there is no point in doing a subset now, because of the costs of XML complexity have already been paid. I have a number of responses to this. First, XML complexity continues to have a cost even when XML parsers and other tools have been written; it is an ongoing cost to users of XML and developers of XML applications. Second, the main appeal of MicroXML should be to those who are not using XML, because they find XML overly complex. Third, many specifications that support XML are in fact already using their own ad-hoc subsets of XML (eg XMPP, SOAP, E4X, Scala). Fourth, this argument applied to SGML would imply that XML was pointless.

HTML5 is another major factor. HTML5 defines an XML syntax (ie XHTML) as well as an HTML syntax. However, there are a variety of practical reasons why XHTML, by which I mean XHTML served as application/xml+xhtml, isn't common on the Web. For example, IE doesn't support XHTML; Mozilla doesn't incrementally render XHTML. HTML5 makes it possible to have "polyglot" documents that are simultaneously well-formed XML and valid HTML5. I think this is potentially a superb format for documents: it's rich enough to represent a wide range of documents, it's much simpler than full HTML5, and it can be processed using XML tools. There's an W3C WD for this. The WD defines polyglot documents in a slightly different way, requiring them to produce the same DOM when parsed as XHTML as when parsed as HTML; I don't see much value in this, since I don't see much benefit in serving documents as application/xml+xhtml. The practical problem with polyglot documents is that they require the author to obey a whole slew of subtle lexical restrictions that are hard to enforce using an XML toolchain and a schema language. (Schematron can do a bit better here than RELAX NG or XSD.)

So one of the major design goals I have for MicroXML is to facilitate polyglot documents. More precisely the goal is that a document can be guaranteed to be a valid polyglot document if:

it is well-formed MicroXML, and
it satisfies constraints that are expressed purely in terms of the MicroXML data model.

Now let's look in detail at what MicroXML might consist of. (When I talk about HTML5 in the following, I am talking about its HTML syntax, not its XML syntax.)

Specification. I believe it is important that MicroXML has its own self-contained specification, rather being defined as a delta on existing specifications.
DOCTYPE declaration. Clearly the internal subset should not be allowed. The DOCTYPE declaration itself is problematic. HTML5 requires valid HTML5 documents to start with a DOCTYPE declaration. However, HTML5 uses DOCTYPE declarations in a fundamentally different way to XML: instead of referencing an external DTD subset which is supposed to be parsed, it tells the HTML parser what parsing mode to use. Another factor is that almost the only thing that the XML subsets out there agree on is to disallow the DOCTYPE declaration. So my current inclination is to disallow the DOCTYPE declaration in MicroXML. This would mean that MicroXML does not completely achieve the goal I set above for polyglot documents. However, you would be able to author a <body> or a <section> or an <article> as MicroXML; this would then have to be assembled into a valid HTML5 document by a separate process (albeit a very simple one). It would be great if HTML5 provided an alternate way (using attributes or elements) to declare that an HTML document be parsed in standards mode. Perhaps a boolean "standard" attribute on the <meta> element?
Error handling. Many people in the HTML community view XML's draconian error handling as a major problem. In some contexts, I have to agree: it is not helpful for a user agent to stop processing and show an error, when a user is not in a position to do anything about the error. I believe MicroXML should not impose any specific error handling policy; it should restrict itself to specifying when a document is conforming and specifying the instance of the data model that is produced for a conforming document. It would be possible to have a specification layered on top of MicroXML that would define detailed error handling (as for example in the XML5 specification).
Namespaces. This is probably the hardest and most controversial issue. I think the right answer is to take a deep breath and just say no. One big reason is that the HTML5 does not support namespaces (remember, I am talking about the HTML syntax of HTML5). Another reason is that the basic idea of binding prefixes to URIs is just too hard; the WHATWG wiki has a good page on this. The question then becomes how does MicroXML handle the problems that XML Namespaces addresses. What do you do if you need to create a document that combines multiple independent vocabularies? I would suggest two mechanisms:
- I would support the use of the xmlns attribute (not xmlns:x, just bare xmlns). However, as far as the MicroXML data model is concerned, it's just another attribute. It thus works in a very similar way to xml:lang: it would be allowed only where a schema language explicitly permits it; semantically it works as an inherited attribute; it does not magically change the names of elements.
- I would also support the use of prefixes. The big difference is that prefixes would be meaningful and would not have to be declared. Conflicts between prefixes would be avoided by community cooperation rather than by namespace declarations. I would divide prefixes into two categories: prefixes without any periods, and prefixes with one or more periods. Prefixes without periods would have a lightweight registration procedure (ie a mailing list and a wiki); prefixes with periods would be intended for private use only and would follow a reverse domain name convention (e.g. com.jclark.foo). For compatibility with XML tools that require documents to be namespace-well-formed, it would be possible for MicroXML documents to include xmlns:* attributes for the prefixes it uses (and a schema could require this). Note that these would be attributes from the MicroXML perspective. Alternatively, a MicroXML parser could insert suitable declarations when it is acting as a front-end for a tool that expects an namespace well-formed XML infoset.
Comments. Allowed, but restricted to be HTML5-compatible; HTML5 does not allow the content of a comment to start with -or ->.
Processing instructions. Not allowed. (HTML5 does not allow processing instructions.)
Data model. The MicroXML specification should define a single, normative data model for MicroXML documents. It should be as simple possible:
- The model for a MicroXML document consists of a single element.
- Comments are not included in the normative data model.
- An element consists of a name, attributes and content.
- A name is a string. It can be split into two parts: a prefix, which is either empty or ends in a colon, and local name.
- Attributes are a map from names to Unicode strings (sequences of Unicode code-points).
- Content is an ordered sequence of Unicode code-points and elements.
- An element probably also needs to have a flag saying whether it's an empty element. This is unfortunate but HTML5 does not treat an empty element as equivalent to a start-tag immediately followed by an end-tag: elements like <br> cannot have end-tag, and elements that can have content such as <a> cannot use the empty element syntax even if they happen to be empty. (It would be really nice if this could be fixed in HTML5.)
Encoding. UTF-8 only. Unicode in the UTF-8 encoding is already used for nearly 50% of the Web. See this post from Google. XML 1.0 also requires support for UTF-16, but UTF-16 is not in my view used sufficiently on the Web to justify requiring support for UTF-16 but not other more widely used encodings like US-ASCII and ISO-8859-1.
XML declaration. Not allowed. Given UTF-8 only and no DOCTYPE declarations, it is unnecessary. (HTML5 does not allow XML declarations.)
Names. What characters should be allowed in an element or attribute name? I can see three reasonable choices here: (a) XML 1.0 4th edition, (b) XML 1.0 5th edition or (c) the ASCII-only subset of XML name characters (same in 4th and 5th editions). I would incline to (b) on the basis that (a) is too complicated and (c) loses too much expressive power.
Attribute value normalization. I think this has to go. HTML5 does not do attribute value normalization. This means that it is theoretically possible for a MicroXML document to be interpreted slightly differently by an XML processor than by a MicroXML processor. However, I think this is unlikely to be a problem in practice. Do people really put newlines in attribute values and rely on their being turned into spaces? I doubt it.
Newline normalization. This should stay. It makes things simpler for users and application developers. HTML5 has it as well.
Character references. Without DOCTYPE declarations, only the five built-in character entities can be referenced. Things could be simplified a little by allowing only hex or only decimal numeric character references, but I don't think this is worthwhile.
CDATA sections. I think best to disallow. (HTML5 allows CDATA sections only in foreign elements.) XML 1.0 does not allow the three-character sequence ]]> to occur in content. This restriction becomes even more arbitrary and ugly when you remove CDATA sections, so I think it is simpler just to require > to always be entered using a character reference in content.

Here's a complete grammar for MicroXML (using the same notation as the XML 1.0 Recommendation):

# Documents
document ::= (comment | s)* element (comment | s)*
element ::= startTag content endTag
          | emptyElementTag
content ::= (element | comment | dataChar | charRef)*
startTag ::= '<' name (s+ attribute)* s* '>'
emptyElementTag ::= '<' name (s+ attribute)* s* '/>'
endTag ::= '</' name s* '>'
# Attributes
attribute ::= name s* '=' s* attributeValue
attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"'
                 | "'" ((attributeValueChar - "'") | charRef)* "'"
attributeValueChar ::= char - ('<'|'&')
# Data characters
dataChar ::= char - ('<'|'&'|'>')
# Character references
charRef ::= decCharRef | hexCharRef | namedCharRef
decCharRef ::= '&#' [0-9]+ ';'
hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'
namedCharRef ::= '&' charName ';'
charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'
# Comments
comment ::= '<!--' (commentContentStart commentContentContinue*)? '-->'
# Enforce the HTML5 restriction that comments cannot start with '-' or '->'
commentContentStart ::= (char - ('-'|'>')) | ('-' (char - ('-'|'>')))
# As in XML 1.0
commentContentContinue ::= (char - '-') | ('-' (char - '-'))
# Names
name ::= (simpleName ':')? simpleName
simpleName ::= nameStartChar nameChar*
nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D]
                | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF]
                | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
# White space
s ::= #x9 | #xA | #xD | #x20
# Characters
char ::= s | ([#x21-#x10FFFF] - forbiddenChar)
forbiddenChar ::= surrogateChar | #FFFE | #FFFF
surrogateChar ::= [#xD800-#xDFFF]

XML vs the Web

2010-11-24T15:05:00.001+07:00

Twitter and Foursquare recently removed XML support from their Web APIs, and now support only JSON. This prompted Norman Walsh to write an interesting post, in which he summarised his reaction as "Meh". I won't try to summarise his post; it's short and well-worth reading.

From one perspective, it's hard to disagree. If you're an XML wizard with a decade or two of experience with XML and SGML before that, if you're an expert user of the entire XML stack (eg XQuery, XSLT2, schemas), if most of your data involves mixed content, then JSON isn't going to be supplanting XML any time soon in your toolbox.

Personally, I got into XML not to make my life as a developer easier, nor because I had a particular enthusiasm for angle brackets, but because I wanted to promote some of the things that XML facilitates, including:

textual (non-binary) data formats;
open standard data formats;
data longevity;
data reuse;
separation of presentation from content.

If other formats start to supplant XML, and they support these goals better than XML, I will be happy rather than worried.

From this perspective, my reaction to JSON is a combination of "Yay" and "Sigh".

It's "Yay", because for important use cases JSON is dramatically better than XML. In particular, JSON shines as a programming language-independent representation of typical programming language data structures. This is an incredibly important use case and it would be hard to overstate how appallingly bad XML is for this. The fundamental problem is the mismatch between programming language data structures and the XML element/attribute data model of elements. This leaves the developer with three choices, all unappetising:

live with an inconvenient element/attribute representation of the data;
descend into XML Schema hell in the company of your favourite data binding tool;
write reams of code to convert the XML into a convenient data structure.

By contrast with JSON, especially with a dynamic programming language, you can get a reasonable in-memory representation just by calling a library function.

Norman argues that XML wasn't designed for this sort of thing. I don't think the history is quite as simple as that. There were many different individuals and organisations involved with XML 1.0, and they didn't all have the same vision for XML. The organisation that was perhaps most influential in terms of getting initial mainstream acceptance of XML was Microsoft, and Microsoft was certainly pushing XML as a representation for exactly this kind of data. Consider SOAP and XML Schema; a lot of the hype about XML and a lot of the specs built on top of XML for many years were focused on using XML for exactly this sort of thing.

Then there are the specs. For JSON, you have a 10-page RFC, with the meat being a mere 4 pages. For XML, you have XML 1.0, XML Namespaces, XML Infoset, XML Base, xml:id, XML Schema Part 1 and XML Schema Part 2. Now you could actually quite easily take XML 1.0, ditch DTDs, add XML Namespaces, xml:id, xml:base and XML Infoset and end up with a reasonably short (although more than 10 pages), coherent spec. (I think Tim Bray even did a draft of something like this once.) But in 10 years the W3C and its membership has not cared enough about simplicity and coherence to take any action on this.

Norman raises the issue of mixed content. This is an important issue, but I think the response of the average Web developer can be summed up in a single word: HTML. The Web already has a perfectly good format for representing mixed content. Why would you want to use JSON for that? If you want to embed HTML in JSON, you just put it in a string. What could be simpler? If you want to embed JSON in HTML, just use <script> (or use an alternative HTML-friendly data representation such as microformats). I'm sure Norman doesn't find this a satisfying response (nor do I really), but my point is that appealing to mixed content is not going to convince the average Web developer of the value of XML.

There's a bigger point that I want to make here, and it's about the relationship between XML and the Web. When we started out doing XML, a big part of the vision was about bridging the gap from the SGML world (complex, sophisticated, partly academic, partly big enterprise) to the Web, about making the value that we saw in SGML accessible to a broader audience by cutting out all the cruft. In the beginning XML did succeed in this respect. But this vision seems to have been lost sight of over time to the point where there's a gulf between the XML community and the broader Web developer community; all the stuff that's been piled on top of XML, together with the huge advances in the Web world in HTML5, JSON and JavaScript, have combined to make XML be perceived as an overly complex, enterprisey technology, which doesn't bring any value to the average Web developer.

This is not a good thing for either community (and it's why part of my reaction to JSON is "Sigh"). XML misses out by not having the innovation, enthusiasm and traction that the Web developer community brings with it, and the Web developer community misses out by not being able to take advantage of the powerful and convenient technologies that have been built on top of XML over the last decade.

So what's the way forward? I think the Web community has spoken, and it's clear that what it wants is HTML5, JavaScript and JSON. XML isn't going away but I see it being less and less a Web technology; it won't be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.

In the short-term, I think the challenge is how to make HTML5 play more nicely with XML. In the longer term, I think the challenge is how to use our collective experience from building the XML stack to create technologies that work natively with HTML, JSON and JavaScript, and that bring to the broader Web developer community some of the good aspects of the modern XML development experience.

A tour of the open standards used by Google Buzz

2010-02-12T11:56:00.000+07:00

The thing I find most attractive about Google Buzz is its stated commitment to open standards:

We believe that the social web works best when it works like the rest of the web — many sites linked together by simple open standards.

So I took a bit of time to look over the standards involved. I’ll focus here on the standards that are new to me.

One key design decision in Google Buzz is that individuals in the social web should be identifiable by email addresses (or at least strings that look like email addresses). On balance I agree with this decision: although it is perhaps better from a purist Web architecture perspective to use URIs for this, I think email addresses work much better from a UI perspective.

Google Buzz therefore has some standards to address the resulting discovery problem: how to associate metadata with something that looks like an email address. There are two key standards here:

XRD. This is a simple XML format developed by the OASIS XRI TC for representing metadata about a resource in a generic way. This looks very reasonable and I am happy to see that it is free of any XRI cruft. It seems quite similar to RDDL.
WebFinger. This provides a mechanism for getting from an email address to an XRD file. It’s a two-step process based on HTTP. First of all you HTTP get an XRD file from a well-known URI constructed using the domain part of the email address (the well-known URI follows the Defining Well-Known URIs and host-meta Internet Drafts). This per-domain XRD file provides (amongst other things) a URI template that tells you how to construct a URI for an email address in that domain; dereferencing this URI will give you an XRD representation of metadata related to that email address. There seem to be some noises about a JSON serialization, which makes sense: JSON seems like a good fit for this problem.

One of the many interesting things you can do with such a discovery mechanism is to associate a public key with an individual. There’s a spec called Magic Signatures that defines this. Magic Signatures correctly eschews all the usual X.509 cruft, which is completely unnecessary here; all you need is a simple RSA public key. My one quibble would be that it invents its own format for public keys, when there is already a perfectly good standard format for this: the DER encoding of the RSAPublicKey ASN.1 structure (defined by RFC 3477/PKCS#1), as used by eg OpenSSL.

Note that for this to be secure, WebFinger needs to fetch the XRD files in a secure way, which means either using SSL or signing the XRD file using XML-DSig; in both these cases it is leveraging the existing X.509 infrastructure. The key architectural decision here is to use the X.509 infrastructure to establish trust at the domain level, and then to use Web technologies to extend that chain of trust from the domain to the individual. From a deployment perspective, I think this will work well for things like Gmail and Facebook, where you have many users per domain. The challenge will be do make it work well for things like Google Apps for your Domain, where the number of users per domain may be few. At the moment, Google Apps requires the domain administrator only to set up some DNS records. The problem is that DNS isn’t secure (at least until DNSSEC is widely deployed). Here’s one possible solution: the user’s domain (e.g. jclark.com) would have an SRV record pointing to a host in the provider’s domain (e.g. foo.google.com); the XRD is fetched using HTTP, but is signed using XML-DSig and an X.509 certificate for the user’s domain. The WebFinger service provider (e.g. Google) would take care of issuing these certificates, perhaps with flags to limit their usage to WebFinger (Google already verifies domain control as part of the Google Apps setup process). The trusted roots here might be different from the normal browser vendor determined HTTPS roots.

The other part of Magic Signatures is billed as a simpler alternative to XML-DSig which also works for JSON. The key idea here is to avoid the whole concept of signing an XML information item and thus avoid the need for canonicalization. Instead you sign a byte sequence, which is encoded in base64 as the content of an XML element (or as a JSON string). I don’t agree with the idea of always requiring base64 encoding of the content to be signed: that seems to unnecessarily throw away many of the benefits of a textual format. Instead, when the byte sequence that you are signing is representing a Unicode string, you should be able to represent the Unicode string directly as the content of an XML element or as a JSON string, using the built-in quoting mechanisms of XML (character references/entities and CDATA sections) or JSON. The Unicode string that results from XML or JSON parsing would be UTF-8 encoded before the standard signature algorithm is applied. A more fundamental problem with Magic Signatures is that it loses the key feature of XML-DSig (particularly with enveloped signatures) that applications that don’t know or care about signing can still understand the signed data, simply by ignoring the signature. I completely sympathize with the desire to avoid the complexity of XML-DSig, but I’m unconvinced that Magic Signatures is the right way to do so. Note that XRD has a dependency on XML-DSig, but it specifies a very limited profile of XML-DSig, which radically reduces the complexity of XML-DSig processing. For JSON, I think i

There are also standards that extend Atom. The simplest are just content extensions:

Atom Activity Extensions provides semantic markup for social networking activities (such as "liking" something or posting something). This makes good sense to me.
Media RSS Module provides extensions for dealing with multimedia content. These were originally designed by Yahoo for RSS. I don't yet understand how these interact with existing Atom/AtomPub mechanisms for multimedia (content/@src, link).

There are also protocol extensions:

PubSubHubbub provides a scalable way of getting near-realtime updates from an Atom feed. The Atom feed includes a link to a “hub”. An aggregator can then register with hub to be notified when a feed is updated. When a publisher updates a feed, it pings the hub and the hub then updates all the aggregators that have registered with it. This is intended for server-based aggregators, since the hub uses HTTP POST to notify aggregators.
Salmon makes feed aggregation two-way. Suppose user A uses only social networking site X and user B uses only social networking site Y. If user A wants to network with B, then typically either A has to join Y or B has to join X. This pushes the world in the direction of having one dominant social network (i.e. Facebook). In the long-term I don’t think this is a good thing. The above extensions solve part of the problem. X can expose a profile for A that links to an Atom feed, and Y can use this to provide B with information about A. But there’s a problem. Suppose B wants to comment on one of A’s entries. How can Y ensure that B’s comment flows back to X, where A can see it? Note that there may be another user C on another social networking site Z that may want to see B’s comment on A’s entry. The basic idea is simple: the Atom feed for A exposed by X links to a URI to which comments can be posted. The heavy lifting of Salmon is done by Magic Signatures. Signing the Atom entries is the key to allowing sites to determine whether to accept comments.

Google seems to planning to use the Open Web Foundation (OWF) for some of these standards. Although the OWF’s list of members includes many names that I recognize and respect, I don’t really understand why we need the OWF. It seems very similar to the IETF in its emphasis on individual participation. What was the perceived deficiency in the IETF that motivated the formation of the OWF?

Mac Day 1

2010-02-06T16:44:00.001+07:00

I decided to dip my toe in the Mac world and buy a Mac mini. If I decide to make the switch, I will probably end up getting a fully tricked out MacBook Pro, but I'm not ready for that yet and I want to wait for the expected MacBook Pro refresh.

I've been using it for 24 hours.

Likes

The hardware is beautiful. The attention to detail is fantastic. Somebody has taken the time to think about even something as mundane as the power cord (it's less stiff than normal power cords and curls nicely). The whole package exudes quality.
It's reassuring to have something Unix-like underneath.
Mostly things "just work".
The dock is quite pretty and intuitive.
Set up was smooth and simple.

Dislikes

The menu bar is an abomination. When you have a large screen, it makes no sense to have the menus always at the top left of the screen, which may well be far from the application window.
On screen font rendering seems less good than Windows. I notice this particularly in Safari. It's tolerable, but the Mac is definitely a step down in quality here.
I was surprised how primitive the application install, update and removal experience was. I miss apt-get. Many updates seem to require a restart.
I don't like the wired Apple mouse. Although it looks nice, clicking is not as easy as with a cheap, conventional mouse, plus the lead is way too short.

Minor nits

How is a new user supposed to find the web browser? The icon is a compass (like the iPhone icon that gives a real compass) and the tooltip says "Safari".
A Safari window with tabs looks ugly to me: there's this big band of gray and black at the top of the window.
Not convinced DisplayPort has sufficient benefits over HDMI to justify a separate standard.
I couldn't find a way of playing a VCD using the standard applications. I ended up downloading VLC, which worked fine.
The Magnification preference on the Dock was not on by default, even though it was enabled in the introductory Apple video.

So far I've installed:

NeoOffice
Adium (didn't work well with MSN, which is the dominant chat system in Thailand, so I will probably remove it)
Microsoft Messenger
Emacs
Blogo, which I am using to write this. Is there a better free equivalent to Windows Live Writer?
VLC
Skype

I plan to install

XCode
iWork

Any other software I should install? Should I be using something other than Safari as my Web browser?

XML Namespaces

2010-01-02T13:33:00.001+07:00

One of my New Year’s resolutions is to blog more. I don’t expect I’ll have much more success with this than I usually do with my New Year’s resolutions, but at least I can make a start.

I have been continuing to have a dialog with some folks at Microsoft about M. This has led me to do a lot of thinking about what is good and bad about the XML family of standards.

The standard I found it most hard to reach a conclusion about was XML Namespaces. On the one hand, the pain that is caused by XML Namespaces seems massively out of proportion to the benefits that they provide. Yet, every step on the process that led to the current situation with XML Namespaces seems reasonable.

We need a way to do distributed extensibility (somebody should be able to choose a name for an element or attribute that won’t conflict with anybody else’s name without having to check with some central naming).
The one true way of naming things on the Web is with a URI.
XML is supposed to be human readable/writable so we can’t expect people to put URIs in every element/attribute name, so we need a shorter human-friendly name and a way to bind that to a URI.
Bindings need to nest so that XML Namespace-generating processes can stream, and so that one document can easily be embedded in another.
XML Namespace processing should be layered on top of XML 1.0 processing.
Content and attribute values can contain strings that represent element and attribute names; these strings should be handled uniformly with names that the XML parser recognizes as element and attribute names.

I would claim that the aspect of XML Namespaces that causes pain is the URI/prefix duality: the thing that occurs in the document (the prefix + local name) is not the same as the thing that is semantically significant (the namespace URI + local name). As soon as you accept this duality, I believe you are doomed to a significant extra layer of complexity.

The need for this duality stemmed from the use of URIs for names. As far as I remember, there was actually no discussion in the XML WG on this point when we were doing XML Namespaces: it was treated as axiomatic that URIs were the right thing to use here. But this is where I believe XML Namespaces went wrong.

From a purely practical point of view, the argument for naming namespaces with URIs is that you can do a GET on the URI and get something human- or machine-readable back that tells you about the semantics of the namespace. I have two responses to this:

This is a capability that is occasionally useful, but it’s not that useful. The utility here is of a completely different order of magnitude compared to the disutility that results from the prefix/URI duality. Of course, if you are a RDF aficionado, you probably disagree.
You can make names resolvable without using URIs. For example, a MIME-type X/Y can be made resolvable by having a convention that it http://www.iana.org/assignments/media-types/X/Y; or, if you have a dotted DNS-style name (e.g. org.example.bar.foo), you can use DNS TXT records to make it resolvable.

From a more theoretical point of view, I think the insistence on URIs for namespaces is paying insufficient attention to the distinction between instances of things and types of things. The Web works as well as it does because there is an extraordinarily large number of instances of things (ie Web pages) and a relatively very small number of types of things (ie MIME types). Completely different considerations apply to naming instances and naming types: both the scale and the goals are completely different. URIs are the right way to name instances of things on the Web; it doesn’t follow that they are the right way to name types of things.

I also have a (not very well substantiated) feeling that using URIs for namespaces tends to increase coupling between XML documents and their processing. An example is that people tend to assume that you can determine the XML schema for a document just by looking at the namespace URI of the document element.

What lessons can we draw from this?

For XML, what is done is done. As far as I can tell, there is zero interest amongst major vendors in cleaning up or simplifying XML. I have only two small suggestions, one for XML language designers and one for XML tool vendors:

For XML language designers, think whether it is really necessary to use XML Namespaces. Don’t just mindlessly stick everything in a namespace because everybody else does. Using namespaces is not without cost. There is no inherent virtue in forcing users to stick xmlns=”…” on the document element.
For XML vendors, make sure your tool has good support for documents that don’t use namespaces. For example, don’t make the namespace URI be the only way to automatically find a schema for a document.

What about future formats? First, I believe there is a real problem here and a format should define a convention (possibly with some supporting syntax) to solve the problem. Second, a solution that involves a prefix/URI duality is probably not a good approach.

Third, a purely registry-based solution imposes centralization in situations where there’s no need. On the other hand, a purely DNS-based solution puts all extensions on the same level, when in reality from a social perspective extensions are very different: an extension that has been standardized or has a public specification is very different from an ad hoc extension used by a single vendor. It’s good if a technology encourages cooperation and coordination.

My current thinking is that a blend of registry- and DNS-based approaches would be nice. For example, you might have something like this:

names consist of one or more components separated by dots;
usually names consist of a single component, and their meaning is determined contextually;
names consisting of multiple components are used for extensions; the initial component must be registered (the registration process can be as lightweight as adding an entry to a wiki, like WHATWG does HTML5 for rel values);
there is a well-known URI for each registered initial component;
one registered initial component is “dns”: the remaining components are a reversed DNS name (Mark Nottingham’s had a ID like this for MIME types); there’s some way of resolving such a name into a URI.

Some other people’s thinking on this that I’ve found helpful: Mark Nottingham, Jeni Tennison, Tim Bray (and the rest of that xml-dev thread).

Getting involved with M

2009-03-23T13:25:00.001+07:00

I spent last week in Redmond talking to Microsoft about M and Oslo. The question at the back of my mind before I went was "Does M really have the potential in the long term to be an interesting and useful alternative to XML?". My tentative answer is yes. Here's why:

Although M, as it is today, is interesting in a number of ways, it is obviously a long way from being a serious alternative to XML (at least for the kinds of application I am interested in). One of my concerns was that I would hear "It's too late to change that". I never did: I was pleasantly surprised that Microsoft are still willing to make fundamental changes to M.
Microsoft recognize that M's long-term potential would be severely limited if it was a proprietary, Microsoft-only technology. I believe they realize that M needs to end up as a genuinely open standard. They've already made initial steps towards a more open process for M. On the other hand, they don't believe in design by committee. (And having seen some of the abominations that design by committee can produce, I can certainly sympathise with that.) There's a senior Microsoft guy that gets to make the final call on all design decisions. In other words, it's a benevolent dictator model. I'm OK with this in principle (although I like it even better when I'm the benevolent dictator). I think it's worked really well in a number of cases (C# and Python spring to mind). But obviously it all depends on the qualities of the particular benevolent dictator. From my interactions so far, he seems like a really smart guy and he's willing to listen.
Microsoft is addressing the whole stack. An alternative to XML needs to provide not only an alternative to XML itself but also alternatives to XSD/RELAX NG and XQuery/XSLT.
Microsoft seem to be designing things in a principled way; they are paying attention to the relevant CS theory. For example, ML seems to be a major influence. They are making an effort to produce something clean, elegant, even beautiful, rather than doing just enough to get a product out.
Microsoft seem willing to take documents seriously. This is a make or break issue for me, because the kind of data I care about most is documents and M, as it is today, is not useful for documents. This was probably the issue we spent the most time on; I talked a lot about the importance of mixed content. One of the Microsoft team suggested the goal of using M to do the M specification. I think this sort of dogfooding will be very helpful in ensuring M works well for documents.

Of course, it's early days yet, and it's hard to tell how much leverage M will get, but there's enough potential to make me want to be involved.

RELAX NG and xml:id

2009-01-17T13:21:00.001+07:00

One part of the vision underlying RELAX NG is that validation should not be monolithic: it is not necessary or desirable to have one schema language that can handle every possible kind of validation you might want to do; it is better instead to have multiple specialized languages, each of which does one kind validation, really well. Consistent with this vision, RELAX NG provides only grammar-based validation. There's no implicit claim that other kinds of validation aren't useful and important.

One kind of validation that is clearly useful and important and that can't be done by grammars is checking of cross-references. One possibility is to use Schematron for this. The designers of RELAX NG anticipated that there would be a little schema language specialized to this, which would be created as part of the ISO DSDL effort (as part 6); this wouldn't be a million miles from the kind of thing that XSD provides with xs:key/xs:unique/xs:keyref. Unfortunately this hasn't happened yet.

Since DTDs provide ID/IDREF checking and we wanted people to be able to move easily from DTDs to RELAX NG, we felt we had to provide some transitional support for ID/IDREF checking while awaiting the ultimate "right" solution. We therefore provided a separate, optional spec called RELAX NG DTD Compatibility. Amongst other things, this defines a way in which RELAX NG processors can optionally provide DTD-compatible ID/IDREF checking based on the datatypes of attributes declared in the schema. Note that this can't handled by the XSD datatypes library for RELAX NG, because assignment of types in the schema to values in the instance is not part of the RELAX NG model of validation.

When defining RELAX NG DTD compatibility, we took a fairly hard line about being DTD-compatible. In particular, we made it a requirement that you should be able to generate a DTD subset from the RELAX NG schema that would perform the same type assignment that the process defined by the spec would perform. This creates some problems when you use DTD Compatibility in conjunction with wildcards (which of course aren't a DTD feature). For example:

start = element doc { p* }
p = element p { id?, any* }
id = attribute id { xsd:ID }
any = element * { attribute * { text }*, (any|text)* }

will get a error about conflicting ID-types for p/@id. This is because the schema allows <p> to contain a <p> element with an id attribute that doesn't have type ID. Instead you would have to write:

start = element doc { p* }
p = element p { id?, any* }
id = attribute id { xsd:ID }
any = element * - p { attribute * { text }*, (any|text)* }

Several years after the DTD compatibility spec was finished, the W3C came out with the xml:id Recommendation. The spec mentions RELAX NG in a non-normative appendix and encourages authors "to declare attributes named xml:id with the type xs:ID". Now on the face of it, this seems pretty reasonable advice. Unfortunately, from the point of the RELAX NG DTD Compatibility spec it's precisely the wrong thing to do. For example, this

start = element doc { p* }
p = element p { id?, any* }
id = attribute xml:id { xsd:NCName }
any = element * { attribute * { text}*, (any|text)* }

will work perfectly with RELAX NG with or without DTD compatibility. The XML processor does the xml:id checking, and RELAX NG can ignore ID/IDREFs. But if instead you follow the xml:id Recommendation's suggestion and do:

start = element doc { p* }
p = element p { id?, any* }
id = attribute xml:id { xsd:ID }
any = element * { attribute * { text}*, (any|text)* }

a RELAX NG validator that implements RELAX NG DTD compatibility will give you an error about conflicting ID-types p/@xml:id. You might think you could do

start = element doc { p* }
p = element p { id?, any* }
id = attribute xml:id { xsd:ID }
any = element * { attribute * - xml:id { text}*, id?, (any|text)* }

but that won't work either, because although you can now write a DTD subset that does equivalent type assignment for p, you can't do it for the other elements.

(The xml:id Recommendation also says in the RELAX NG section that "A document that uses xml:id attributes that have a declared type other than xs:ID will always generate xml:id errors.". I don't see why: the xml:id processor is quite likely to be part of the XML parser, which doesn't know anything about RELAX NG, nor does RELAX NG know anything about xml:id.)

Back when RELAX NG DTD compatibility spec came out, I implemented support for the ID/IDREF checking part of DTD Compatibility in Jing. I also decided to make Jing enforce this by default. There's a -i switch to turn it off. Before xml:id came along, this seemed to work OK: if a schema author specifies ID/IDREF in a RELAX NG schema then they usually want ID/IDREFs to be checked and RELAX NG DTD Compatibility was the only thing that could do this checking. With xml:id this no longer works well: if you

use xml:id
declare xml:id attributes as type xsd:ID in the RELAX NG schema
use wildcards in your RELAX NG schema
don't use any special options to Jing

you are very likely to get an error from Jing.

At first, my plan was simply to change Jing not to enforce DTD Compatibility by default. However, Alex Brown pointed out that this isn't completely satisfactory: people who are coming from DTDs and aren't using xml:id lose the sensible ID/IDREF checking that they might reasonably expect to happen by default. So now I'm thinking that a better solution might be to add two boolean options to Jing, both of which would be enabled by default.

The first option would be to make it a warning rather than an error if the schema does not use ID/IDREF in a DTD-compatible way. (If the schema is DTD-compatible, then duplicate IDs or IDREFs to non-existent IDs would still be errors.)

The second option would tell Jing to be "xml:id aware". This would have several effects.

It would require attributes named xml:id to be declared with type xsd:ID (or with the ID type from the datatype library defined by the DTD compatibility spec). This isn't strictly necessarily, but it would seem to minimize confusion and be in keeping with the spirit of the xml:id Recommendation. It's slightly tricky to decide what this means with various unusual RELAX NG wildcards. It is obvious that attribute xml:id { text} is an error. But the following are not all obvious to me:
- attribute xml:id|id { text }
- attribute * { text }
- attribute xml:* { text }
- attribute *|xml:id { text }
When checking whether you can generate an equivalent DTD subset, xml:id attributes would be ignored. In the terms defined by the RELAX NG DTD Compatibility spec, you would ignore xml:id attributes when determining whether the schema is compatible with the ID/IDREF feature.
When checking uniquess of IDs, and when checking IDREFs, an attribute named xml:id would always be treated as an ID attribute.

It might also be a good idea to revise the RELAX NG DTD compatibility spec to be xml:id aware in this way.

My new laptop

2009-01-12T12:20:00.001+07:00

I just bought a new laptop. Previously I was using a Sony VAIO TZ, specifically a TZ38GN:

1.15kg
11.1" screen, 1366x768 resolution, LED backlit
U7700 (1.33GHz) CPU
2Gb RAM
48Gb solid state disk
DVD writer

I bought this during a period where I was doing mostly email and web browsing and no coding, and it worked beautifully for that. The screen was fantastic, and it was the perfect size.

However, when I started doing some coding, I began to find it a little bit wimpy: CPU a bit slow, not quite enough RAM, disk too small. Also when I'm at home, I like to use my laptop with an external keyboard and a large 24" (1920x1200) display, and the TZ would only go up to 1680x1050.

I also have an old Dell Precision M65, which I bought nearly two years ago:

about 3kg
15.4" screen, 1920x1200 screen, not LED backlit
T7600 (2.33GHz) CPU
4Gb RAM (but the chipset only allowed 3.25Gb to be used)
160Gb 7200rpm disk
DVD writer

This has enough horsepower for coding, but after getting used to the TZ, I found lugging the Dell around to be a real pain, and the screen is much less nice than the VAIO's.

I ended up using the Dell at my desk, and the TZ elsewhere. But having things divided between two machines started to be a real pain, so I wanted to find something with the power of the Dell (at least when connected to a monitor) but the size and weight of the TZ.

I ended up choosing a Sony VAIO Z, specifically a VGN-Z26SN:

1.46kg
13.1" screen, 1600x900 resolution, LED backlit
P8600 (2.4GHz) CPU
3Gb RAM
320Gb 7200rpm disk
DVD writer

They had another model (the Z27), which was 25% more expensive, and offered a 2.53Ghz CPU, 4Gb RAM and a Blu-ray drive. I suspected I wouldn't be able to use much of the extra 1Gb RAM, and I didn't have much use for the Blu-ray drive, but the deciding factor was that, bizarrely, the Z27 didn't come with a proper Thai keyboard.

The VAIO Z cost me about 80,000 baht (about US$2250 at today's exchange rate), which is not as unreasonable as VAIOs usually are. By comparison, I think the TZ cost me about 100,000 baht (mainly because of the SSD), and the Dell was about 180,000 baht (it was a very high-end machine at the time).

Overall I'm reasonably happy with it. The screen is great and the battery life is OK, although not as good as the TZ's. The keyboard is OK, but I wish it was backlit. Also it doesn't have separate Home/End/PgUp/PgDn keys: you have to press the Fn key in conjunction with the arrow keys. It only has two USB ports, which means you will probably need a USB hub. I discovered one incredibly annoying feature: although the CPU has virtualization support, the BIOS prevents you from enabling it. I can understand Sony's not supporting virtualization in the sense of not providing support if you have problems when you use it, but it's hard to accept a policy that actively prevents a customer making use of an important feature of the hardware they have bought.

I am afraid I don't have any useful information about how well Linux runs on it, because I have been using Vista. This might seem like a strange thing for me to do. I'll explain in a separate post.

Saxon performance

2008-12-14T08:30:00.001+07:00

Michael Kay has posted a link to a new paper of his on Saxon performance. Interesting not just for XQuery/XSLT performance but also for XML performance in general.

MGrammar

2008-11-23T18:26:00.001+07:00

MGrammar (Mg) is another key part of Oslo. My first reaction to Mg was: Yet another lex/yacc clone. Yawn. But now that I've looked at it a bit more closely, there are some features I find quite interesting.

I have always found parser generators to be a bit of a pain. I think one of the reasons is that the input to the parser generator typically mixes together a declarative specification of the grammar with procedural code that does something with the parse. There's not a clean separation between my code and the generated code.

Mg works in a rather different way. The specification in Mg is purely declarative. So how does it actually do anything useful? It constructs a labeled tree (actually a DAG) that represents the result of the parse. Mg has language constructs that allow you to control what tree gets constructed, but there's a reasonable default.

Another big difference is that it works much more dynamically than a typical parser generator. The generated parse tree is not strongly typed: it's just nodes with textual labels. You don't have to compile that parser ahead of time (although you can if you want). You can just give the library your grammar and it will compile it into an efficient form; you can them apply that compiled form to an input stream and get a parse tree.

The overall programming experience seems to be much more like using a regex library: at runtime, the regex gets compiled into an executable form; executing the regex tests whether the input matches the regex; and if it does match, you get structured data out (typically an array of strings, one for each captured group). I think this sort of programming model is much more convenient; it's particularly nice for a dynamic language which can potentially deal very conveniently with the untyped parse tree that you get from Mg.

Another interesting feature of Mg is that it has modules. The module system together with the fact that the grammar doesn't include any procedural code opens up the possibility of reusing grammars for languages and fragments of languages. It's hard to say how useful this will actually be in practice.

There's one more feature that's worth mentioning. You can attach annotations (called attributes) to production rules. These annotations can have structure (the same kind of structure as the parse tree). For example, the annotations might tell a text editor how to provide syntax aware editing features: in the Microsoft implementation, there's an annotation that the editor uses to highlight keywords.

The obvious missing feature is that there's no way to automatically go from the parse tree back into the textual form. I assume Microsoft will fix that.

I noticed only one thing that was really broken: Mg supports Unicode, but in cases where you need to specify a single character, it requires a 16-bit code unit representing part of the UTF-16 encoding of a character, rather than a code point (in the range 0 to 0x10FFFF). This is just wrong. It's slightly more work to do it properly, but you really can't avoid it. For example, you need it to properly support Unicode blocks/categories, since these blocks/categories are blocks/categories of code points not code units.

I hope we'll see some open source implementations of Mg: perhaps one in C/C++ hooked up to SpiderMonkey/V8/Python/Ruby/GNU Emacs and one in Java hooked up to Rhino/JRuby/Groovy/NetBeans/Eclipse.

There's a bigger issue lurking here. I think Microsoft see Mg as more than just a nifty library. It's part of their vision for a next generation application development platform, where developers become more productive by using custom DSLs rather than XML. I have mixed feelings about this. The syntax for M itself is defined using Mg, and Microsoft seems to be designing things so that much of the tooling that they build for M can easily be applied to anything with a Mg-defined grammar. The tooling seems to have quite an introspective feel to it, like a sophisticated Lisp or Smalltalk environment. The hacker side of me finds this quite cool.

On the other hand, gratuitous syntactic diversity is not a feature. I remember in the early days of XML, Tim Bray used to start his pitch for XML by showing a whole bunch of widely different Linux config file formats. It was quite compelling: the lack of consistency was obviously confusing and pointless. Now I don't think anybody would suggest that XML is the right format for everything. I wouldn't want to write programs in XML (except sometimes for XSLT :-), and after writing schemas in RELAX NG compact syntax for a while, I wouldn't want to have to go back to writing them in XML. How do you make your platform encourage developers to use a DSL where it makes sense, and discourage them when it doesn't? Up to now, part of the answer was that libraries made it a bit easier to use XML (or some other standard format) rather than some completely custom syntax; so unless there was a substantial benefit from a custom syntax, developers wouldn't bother. But if your platform provides tools that make it really easy to design new syntaxes, how do you avoid ending up in situation where every application has it's own private DSL? It doesn't help users if they have to learn a new syntax for every application. Certainly when I think about interchanging data on the Web, the fewer formats the better; I definitely don't want every application to be using its own completely custom syntax.

Some thoughts on the Oslo Modeling Language

2008-11-23T10:44:00.001+07:00

Microsoft recently introduced Oslo. Microsoft seems to have designed Oslo to replace some of things it now uses XML for. Since Microsoft have been one of the biggest supporters of XML, I think it's worth looking at what they've come up with.

Overview of M

The key part of Oslo is the "M" language, which Microsoft calls a "modeling language". This integrates quite a broad range of functionality:

an abstract data model (analogous to the XML Infoset); M uses the term "value" to describe an instance of this abstract data model
a syntax for writing values (analogous to XML 1.0)
a type system which provides language constructs for describing constraints on values (analogous to XML Schema)
language constructs for querying values (analogous to XQuery)

I guess this is a similar range of functionality to SQL (although there are no constructs for doing updates yet).

I'm going to give a brief overview of M as I understand it. This is based purely on the spec. Please comment if I've misunderstood anything.

Let's start with the abstract data model. There are three kinds of value

simple values
collections
entities

Simple values are what you would expect (Unicode strings, numbers, booleans, dates etc). An important simple value is null, which is distinct from any other simple value.

A collection is a bag of values: it is unordered, it can have duplicates and it can contain arbitrary values.

An entity is a map from Unicode strings to values; each string/value pair is called a field.

A key feature of the abstract data model is that it's a graph rather than a tree. When the value of a field is an entity, the field conceptually holds a reference to that entity. I believe the same goes for collections, but I haven't completely grokked how identity works in M.

Simple values have the sort of syntax you would expect:

"foo" is a Unicode string (M calls string values Text)
123 is an integer
true is the boolean true value
2008-11-22 is a date
null is the null value

A collection is written in curly brackets with items separated by commas: { 1, 2, 3 } is a collection of three integers.

An entity uses a {field-name = field-value} syntax: { x = 2, y = 3 } is an entity with two fields. If the field name isn't an identifier, you have to surround it in square brackets: { [x coordinate] = 2, [y coordinate] = 3}.

You can label an entity or a collection and then use that label to specify a reference to that entity or collection. You label a collection just by putting an identifier before the opening curly brace. So

{ jack { name = "Jack", spouse = jill }, jill { name = "Jill", spouse = jack } }

would be a collection of two entities, where the spouse field of each entity is a reference to the other entity, and

loop { loop }

would denote a collection with a single member, which is a reference to itself.

A type in M can be thought of as being the collection of the values that conform to the type, so one way to specify a type is just to explicitly specify that collection. So, we could define a Boolean type like this:

type Logical { true, false }

In fact, M has predefined types corresponding to each kind of simple value. For entities, you specify a type by specifying the fields. So if we define Point like this:

type Point { x; y; }

then any entity that has both an x field and a y field is a Point. Note that entity types are open: an entity that has a z field as well as an x and a y field would conform to the Point type. You can also specify types for the fields:

type Point {
x : Number;
y : Number;
}

Types can be recursive:

type Node {
left : Node;
right : Node;
}

Collections can be specified using the * and + operators: Integer* is a collection of zero or more integers. There's also a ? operator, but it doesn't have anything to do with collections: T? is equivalent to the union of T and { null }.

So far, nothing very exciting. What makes M interesting is that it has a rich functional expression language that can be used for describing instances, types and queries. The analogy in the XML world would be the way XPath is used in instances via XPointer, in schema languages like Schematron and XSD 1.1 and in XQuery.

As well as the obvious kinds of expressions on simple values, you can have expressions on collections:

C | D is the union of C and D
C & D is the intersection of C and D
v in C tests whether v is a member of C
C <= D tests where C is a subset of D

Most importantly,

C where E

returns a collection containing those members of C for which E evaluates to true; the keyword value is bound to the member of C for which E is being evaluated. So

{ 1, 2, 3, 4 } where value % 2 == 0

returns { 2, 4 }.

What's really powerful is that expressions can be used on types in a similar way to how they are used on collections. If we want a type corresponding to even numbers, we can just do:

type Even : Integer where value % 2 == 0;

You can apply a where expression to an entity type definition to constrain the fields:

type ZeroSum {
x: Number;
y: Number;
} where x + y == 0;

You can also do something like:

type Name {
firstName: Text;
lastName: Text;
}
type Person {
dateOfBirth: Date;
} where value in Name;

Fields can have default values, for example:

type Person {
dateOrBirth : Date? = null;
}

In fact, when the type of a field is nullable (has null as one of its possible values), it automatically gets a default value of null. Similarly, when the type of a field is a collection that may be empty, it automatically gets a default value of an empty collection. This is quite clever: it makes fields declared as ? or * work nicely for both the provider and the consumer; the provider can leave the fields out, as would be natural for the provider, but the consumer always gets a sensible value for the field.

So we've seen how entity types can specify a default value. But how does an entity get connected with an entity type so that the default value can be created? Typing in M is structural. An entity in M doesn't inherently belong to any type other than Entity. Like a pattern in RELAX NG, a type in M describes a possible shape for a value, which any given value may or may not have, but a value isn't created with any particular type (other than one of the fundamental built in types).

M solves this by having a notion of type "ascription". An expression "v : T" ascribes the type T to the value v. M is functional so this doesn't mutate v, rather it returns a new value that has an augmented view of v as specified by T. So whereas

{ firstName: "James", lastName: "Clark" }.dateOfBirth

will throw an error,

({ firstName: "James", lastName: "Clark" } : Person).dateOfBirth

will return null (assuming Person declares dateOfBirth as nullable). (I would guess type ascription also does coercions on simple values.) Note that this implies that types in M aren't simply collections of values. M also allows entity types to have computed fields, which are virtual fields whose values are computed from the value of other fields.

The last area of M I want to look at is identity. You can specify what determines the identity for a entity:

type Person {
firstName: Text;
lastName: Text;
dateOfBirth: Date?
} where identity(firstName, lastName);

This means you can't have two Persons with the same firstName and lastName. The obvious question is: within what scope?

To answer, we have to look at the top-level layer that M provides. Values don't exist in isolation. Everything in M has to exist within a module. A module is a sort of top-level static entity. The fields of a module are called "extents". The scope of an identity constraint is an extent. So, if you do:

module Employee {
type Person {
    firstName: Text;
    lastName: Text;
    dateOfBirth: Date?
} where identity(firstName, lastName);
Persons: Person+
}

then it would be an error if within the Persons extent, there were two Person entities with the same firstName and lastName. There's also a way to automatically give a field a unique id:

type Name {
id : Integer32 = AutoNumber();
firstName: Text;
lastName: Text;
} where identity id;

I don't really have anything to say about the query part: it's quite close to LINQ.

My thoughts about M

There's quite a lot about M that I like. Mostly it seems pretty clean. The type system is powerful. Structural typing is clearly the right approach for something like M. I like the way that constraints that can be checked statically are seamlessly blended with constraints that will need to be checked dynamically. Static type checking is good, but there's no need to rub your users faces in the limitations of your static type checker. (It's interesting to see C# 4.0 moving in a similar direction.)

It's obviously very early days for M, and there's still lots of scope for improvement. Microsoft's initial implementation of M targets SQL and works a bit like a database. But clearly there's potential for using something like M in quite different contexts, e.g. for exchanging data on the Web (like what I talked about some time ago with TEDI). But several aspects of the current design seem to reflect the initial database focus. For example, the current top level wrapper of modules/entities makes sense for a database application of M, but wouldn't work so well if you were using M for exchanging data on the Web.

The spec needs some fleshing out. Microsoft's current implementation is a long way from implementing the full language. I am skeptical whether the language as currently specified can be fully implemented. For example, can you implement the test for whether one type is a subtype of another type so that it works in non-exponential time for two arbitrary types?

I see several major things missing in M, whose absence might be acceptable for a database application of M, but which would be a significant barrier for other applications of M. Most fundamental is order. M has two types of compound value, collections and entities, and they are both unordered. In XML, unordered is the poor relation of ordered. Attributes are unordered, but attributes cannot have structured values. Elements have structure but there's no way in the instance to say that the order of child elements is not significant. The lack of support for unordered data is clearly a weakness of XML for many applications. On the other hand, order is equally crucial for other applications. Obviously, you can fake order in M by having index fields in entities and such like. But it's still faking it. A good modeling language needs to support both ordered and unordered data in a first class way. This issue is perhaps the most fundamental because it affects the data model.

Another area where M seems weak is identity. In the abstract data model, entities have identity independently of the values of their fields. But the type system forces me to talk about identity in an SQL-like way by creating artificial fields that duplicate the inherent identity of the entity. Worse, scopes for identity are extents, which are flat tables. Related to this is support for hierarchy. A graph is a more general data model than a tree, so I am happy to have graphs rather than trees. But when I am dealing with trees, I want to be able to say that the graph is a tree (which amounts to specifying constraints on the identity of nodes in the graph), and I want to be able to operate on it as a tree, in particular I want hierarchical paths.

One of the strengths of XML is that it handles both documents and data. This is important because the world doesn't neatly divide into documents and data. You have data that contains documents and document that contain data. The key thing you need to model documents cleanly is mixed text. How are you going to support documents in M? The lack of support for order is a major problem here, because ordered is the norm for documents.

A related issue is how M and XML fit together. I believe there's a canonical way to represent an M value as an XML document. But if you have data that's in XML how do you express it in M? In many cases, you will want to translate your XML structure into an M structure that cleanly models your data. But you might not always want to take the time to do that, and if your XML has document-like content, it is going to get ugly. You might be better off representing chunks of XML as simple values in M (just as in the JSON world, you often get strings containing chunks of HTML). M should make this easy. You could solve this elegantly with RELAX NG (I know this isn't going to happen given Microsoft's commitment to XSD, but it's an interesting thought experiment): provide a function that allows you to constrain a simple value to match a RELAX NG pattern expressed in the compact syntax (with the compact syntax perhaps tweaked to harmonize with the rest of M's syntax) and use M's repertoire of simple types as a RELAX NG datatype library.

Finally, there's the issue of standardization. The achievement of XML in my mind isn't primarily a technical one. It's a social one: getting a huge range of communities to agree to use a common format. Standardization was the critical factor in getting that agreement. XML would not have gone anywhere as a single vendor format. It was striking that the talks about Oslo at the PDC made several mentions of open source, and how Microsoft was putting the spec under its Open Specification Promise so as to enable open source implementations, but no mentions of standardization. I can understand this: if I was Microsoft, I certainly wouldn't be keen to repeat the XSD or OOXML experience. But open source is not a substitute for standardization.

What's allowed in a URI?

2008-11-17T08:55:00.001+07:00

Java 1.4 introduced the java.net.URI which provides RFC 2936-compliant URI handling. I thought I should try to fix Jing and Trang to use this. So I've been looking through all the relevant specs to figure out to what extent I can leave things to java.net.URI.

It's convenient to begin with XLink. Section 5.4 requires the value of the href attribute to be a URI reference after certain characters that are disallowed by RFC 2396 are escaped. These are described as

all non-ASCII characters, plus the excluded characters listed in Section 2.4 of IETF RFC 2396, except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in IETF RFC 2732

If we look at 2.4.3 of RFC 2396 (why does XLink reference section 2.4 rather than 2.4.3?), we see the following sets of characters excluded:

control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
space = <US-ASCII coded character 20 hexadecimal>
delims = "<" | ">" | "#" | "%" | <">
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Section 3 of RFC 2732 (which modifies RFC 2396 to handle IPv6 addresses) does indeed allow square brackets by removing them from the 'unwise' set.

Putting these all together, we can distinguish the following categories of characters that are allowed by XLink but not allowed by RFC 2396/RFC 2732

C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents
space (#x20)
disallowed ASCII graphic characters, specifically: <>"{}|\^`
delete (#x7F)
non-ASCII Unicode characters, excluding surrogates #x80-#xD7FF, #xE000-#x10FFFF (XML does not allow #xFFFE and #xFFFF)

Looking at the various XML-related specs, things seem to be nicely aligned:

XML 1.0 First Edition required escaping just for category 5, but XML 1.0 Second Edition got fixed to use the same wording as XLink
XML Base uses the same wording as XLink
XML Schema Part 2 references XLink (in specifying xs:anyURI)
RELAX NG references XLink

XSLT 1.0 just references RFC 2396 and doesn't say anything about escaping (as regards xsl:include and xsl:import). That seems like a bug to me. Erratum E39 adds the following to the first paragraph of the spec:

For convenience, XML 1.0 and XML Names 1.0 references are usually used. Thus, URI references are also used though IRI may also be supported. In some cases, the XML 1.0 and XML 1.1 definitions may be exactly the same.

This seems to be intended to extend it to allow IRIs, though it seems like a bit of a hack: there's no reference to the IRI spec, and I don't see how it's "Thus, ". In any case, XSLT 2.0 gets it right: it references xs:anyURI.

RFC 2396 has been updated by RFC 3986. This no longer has a section describing excluded characters, but I believe I am right in saying that the set of Unicode characters that cannot occur anywhere in a URI as defined by RFC 3986 is precisely the union of my categories 1 through 5.

Next we have the IRI spec, RFC 3987. This defines:

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

It adds ucschar to the set of unreserved characters and adds iprivate to what's allowed in the query of a URI. The characters in my category 5 that are in neither ucschar nor iprivate are as follows:

C1 controls: #x80 - #x9F
the 66 Unicode noncharacters: #xFDD0 - #xFDEF, and any code point whose bottom 16 bits are FFFE or FFFF
Specials: #xFFF0 - #xFFFD; these fall into three groups, unassigned specials (#xFFF0 - #xFFF8), annotation characters (#xFFF9 - #xFFFB) and replacement characters (#xFFFC - #xFFFD)
Language tags: #xE0000 - #xE0FFF

I can buy controls and noncharacters being excluded, but the other two seem like over-engineering to me. The arguments for excluding these could equally be applied to various other weird Unicode characters. You don't want to have to change the definition of an IRI whenever Unicode adds some new weird character.

RFC 3987 also has the following in Section 3.2:

Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", "}", "|", "\", "^", and "`"

Those characters correspond to my categories 2 and 3. Overall there are a lot of subtle differences between IRIs and the thing that is currently allowed by XML specs.

Fortunately there is a draft of a new version of the IRI spec. This introduces Legacy Extended IRI (LEIRI) references, which defines ucschar as:

   ucschar        = " " / "<" / ">" / '"' / "{" / "}" / "|"
                     / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
                     / %xE000-FFFD / %x10000-10FFFF

which exactly corresponds to my categories 1 to 5.

LEIRIs seem like a very useful innovation. XML-related specs such as RELAX NG that referenced or incorporated the XLink wording will be able to simply reference RFC 3987bis and say that URI references MUST be LEIRIs and SHOULD be IRIs.

Finally we are ready to look at java.net.URI. This allows URIs to contain an additional set of "other" characters which consist of non-ASCII characters with the exception of:

C1 controls (#x80 - #x9F)
Characters with a category of Zs, Zl or Zp

This means that if you want to give an LEIRI such as an XML system identifier to java.net.URI you first need to percent encode any of the following:

the following ASCII graphic characters: <>"{}|\^`
C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents
space (#x20)
delete (#x7F)
C1 controls (#x80 - #x9F)
Characters with a category of Zs, Zl or Zp

All except the first can be tested with Character.isISOControl(c) || Character.isSpace(c).

Note that you don't want to blindly percent encode all non-ASCII characters because that will unnecessarily make IRIs containing non-ASCII characters unintelligible to humans.

Working on Jing and Trang

2008-11-09T11:18:00.001+07:00

I've been back to working on Jing and Trang for about a month now. It would be something of an understatement to say that they were badly in need of some maintenance love.

I started a jing-trang project on Google Code to host future development. There are new releases of both Jing and Trang in the downloads section of the project site. These have been out for about 10 days, and there have been a reasonable number of downloads, and no reports of any major bugs, so I think these should be fairly solid. (Interestingly, the number of downloads of Trang have been running at about twice those of Jing.)

It's been 5 years since the last release, so what new and exciting features are there? Well, actually, in the current release, none. My work for that release was focused on two areas:

getting things to work properly with current versions of Java and other dependencies;
getting the source code structure and build system into reasonable shape.

The second was a lot of work. The code base for Jing and Trang had evolved over a number of years, incorporating various bits of functionality that were independent of each other to various degrees; its structure only made any sense from a historical perspective. The current structure is now nicely modular. I converted my CVS repository to subversion before I started moving things around, so the complete history is available in the project repository. For people who want to stay on the bleeding edge, it's now really easy to check out and build from subversion.

My natural tendencies are much more to the cathedral than to the bazaar, but I'm trying to be more open. I'm pleased to say that are already two committers in addition to myself. There's a commercial XML editor called <oXygen/>, which uses Jing and Trang to support RELAX NG. The main guy behind that, George Bina, had made a number of useful improvements. In particular, he upgraded Jing's support for the Namespace Routing Language to its ISO-standardized version, which is called NVDL (you might want to start with this NVDL tutorial rather than the spec). This is now on the trunk. The other committer is Henri Sivonen, who has been using Jing in his Validator.nu service.

My goals for the next release are:

complete support for NVDL (I think the only missing feature is inline schemas)
support for the ISO-standardized version of Schematron
customizable resource resolution support (so that, for example, you can use XML catalogs)
support standard JAXP XML validation API (javax.xml.validation)
more code cleanup

Please use the issue tracker to let me know what you would like. Google Code has a system that allow you to vote for issues: if you are logged in, which you can do with a regular Google account, each issue will be displayed with a check box next to a star; checking this box "stars" the issue for you, which both adds a vote for the issue and gets you email notifications about changes to it.

I haven't started any project-specific mailing lists yet. For developers, the issue tracker seems to be enough at the moment. For users, Jing and Trang are within the scope of the existing RELAX NG Users mailing list on Yahoo Groups.

XML 1.0 5th edition

2008-10-17T14:48:00.001+07:00

Rather late in the day, I sent a comment in on the proposed XML 1.0 5th Edition. For some background, read Norman Walsh, John Cowan, David Carlisle and Henry Thompson.

There is a real problem here, and it's partly my fault. If we had had a bit more foresight ten years ago, we would have made the 1st edition of XML 1.0 say what is now being proposed for the 5th Edition. I know that the XML Core WG are trying to do the right thing, but I really don't think this is a good idea.

I think you've got to look at the impact of the change not just on XML 1.0 but on the whole universe of specs that are built on top of XML 1.0. In an ideal world, all the specs that refer to XML 1.0 would have carefully chosen whether to make a dated or an undated reference to XML 1.0, and would have done so consistently and with a full consideration of the consequences of the choice. In practice, I don't believe this has happened. Indeed, before the 5th edition, I believe very few people would have considered that XML might make a fundamental change to its philosophy about which characters were allowed in names while still keeping the same version number.

Even W3C specs don't get this right. In particular, XML Namespaces 1.0 gets completely broken by this (as my comment explains).

Now you can argue that the breakage and chaos that the 5th edition would cause is due to bugs in the specs that reference XML 1.0. But that doesn't make the breakage any less real.

I also have the rather heretical view that the benefits of the change are small. In terms of Unicode support, what's vitally important is that any Unicode character is allowed in attribute values and character data. And XML 1.0 has always supported that. This change is just about the Unicode characters allowed in element and attribute names (and entity names and processing instruction targets).

I see relatively little use of non-ASCII characters in element and attribute names. A user who is technical enough to deal with raw XML markup can deal with ASCII element/attribute names. For less technical users who want to see element/attribute names in their native language, using native language markup is not a good solution, because it only allows a document or schema to be localized for a single language. An XML editor can provide a much better solution by supporting schema annotations that allow an element or attribute to be given friendly names in multiple languages. So a Thai user editing a document using the schema can work with Thai element/attribute names, and an English user working with the same document can see English names.

This is just following basic I18N principles of storing/exchanging information in a language neutral form, and then localizing it when you present it to a particular user. (This is the same reason why it's perfectly OK from an I18N perspective for XML Schema Datatypes just to support one specific non-localized format for dates/times.)

Perhaps this is part of the reason why there was so little enthusiasm for XML 1.1, and why there seems to be little interest in doing the 5th edition change as an XML 1.2.

One case where I can see real value in adopting more permissive rules for names is in XML Schema Datatypes, because this relates to character data and attribute values. But it seems like you could easily fix this, without any of the problems that the 5th edition would cause, by introducing a couple of new datatypes into XML Schema.

HTTPbis

2007-12-09T08:50:00.001+07:00

Mark Nottingham explains the work being done in the IETF to revise HTTP. It sounds to me like they're doing exactly the right thing, focusing on producing a better spec that brings light to some of the darker corners of the protocol and reduces the gap between what the spec says and what you actually need to implement to achieve interoperability. It's good to see that capable people have stepped up to put in the not inconsiderable time and effort that's needed for this unglamorous but very useful work.

Thai personal names

2007-12-07T14:37:00.001+07:00

There's an election coming up in Thailand on December 23rd and the streets are lined with election posters. As a bit of an i18n geek, I find it interesting that the posters almost all make the candidates' first names at least twice as big as their last names. If you're also an i18n geek, your reaction might well be: "it must be because Thais write their family name first, followed by their given name". But you would be wrong. Thais have a given name and a family name; the given name is written first, and the family name last.

The correct explanation that given names play a role in Thai culture that is similar to the role that family names play in many Western cultures. The polite way to address somebody is with an honorific followed by their given name. The Thai telephone book is sorted with given names as the primary key and family names as the secondary key.

(I have to say that this has led me to question what I perceive to be the i18n orthodoxy that it's more i18n-ly correct to talk of given name/family name than first name/last name. Why does it matter whether a name is a family name or a given name? Surely what matters is the cultural role that the name plays.)

I guess that historically the main reason for the dominance of given names in Thai culture is because family names are a relatively recent innovation: they were introduced by King Rama VI towards the beginning of the 20th century. Family names were allocated to families systematically and the use of family names is still controlled by the government. Any two people in Thailand with the same family name are related. This leads to Thai family names being quite a mouthful. Here's a sample from people in the news over the past couple of days: Leophairatana, Tantiwittayapitak, Boonyaratkalin. Even Thais have difficulty remembering each others family names.

If you become a Thai citizen, you have to choose a new, unused family name. Just as with domain names, all the good, short names have gone. So the more recently your family has become Thai, the longer and more unwieldy your family name is likely to be.

Thai given names usually have at least two or three syllables. There aren't any given names that are as commonly used in Thai culture as the most popular given names in Western cultures. I've never come across a situation where two living Thais share the same given name and family name. You would certainly never get the situation of hundreds of people having the same given name and family name (like "James Clark").

Thais rarely use the First.Last@domain convention for email. It would be too unwieldy. The conventions I've seen most often are First.La@domain and First.L@domain (i.e. use only the first one or two characters of the last name).

Another I18N wrinkle is that Thais' official first and given names are in Thai script not in Roman script. But in many situations Thais use romanized versions of their names. And while there is a standard way (actually several standard ways) of romanizing Thai, the convention is that the correct romanization of any personal name is what the holder of the name wishes it to be. (Thus, your application may need to store two versions of names: the Thai script version and the romanized version.)

With honorifics, I think the nastiest gotcha from an i18n perspective is that, while the given and family name are conventionally written separated by a space, there is no separator between the honorific and the given name. (Words in Thai are normally not separated by spaces.) This applies only in Thai script. When romanized, you would need a space between the honorific and the given name.

Since given names are used in Thai culture somewhat like family names are used in some Western cultures, you might be wondering what serves the role that given names serve in Western cultures. All Thais have a name referred to as a "chue len". This is typically translated as "nickname", but it has a more important role in Thai culture than a nickname does in Western culture. I think it would be more accurate to describe it as an "informal given name". Parents give each of their children a chue len, in addition to a formal given name. You would typically use a chue len to address somebody in contexts where in England you might use their first name.

Whereas formal given names are restricted to names that the bureaucrats of the interior ministry deem appropriate, parents can and do follow their personal whims when it come the chue len. For example, a former employee of mine was called "Mote", which was abbreviated from "remote", as in TV remote control. (This illustrates another interesting aspect of Thai culture: words are commonly shortened by omitting all except the last syllable. For example, a kilo is often referred to as a "lo".)

In perhaps 80% of cases the chue len is a single syllable. It's often very difficult to romanize these. Thai has tones as well as one of the richest collection of vowels of any language. Most romanization schemes don't preserve subtle differences in tones and vowels. Whereas this is workable with formal given names and family names, which usually have many syllables and some redundancy, if you don't get the vowel or tone of a chue len exactly right, it becomes another name. For example, another of my employees has a name that sound like the second syllable of the word "apple", but with the "l" changed to a "n", and pronounced in an emphatic (falling) tone. I can write that sound unambiguously in Thai, but I've no idea how to write it in English.

Occasionally the chue len is a shortened version of the given name, but more often it is completely unrelated. If you know somebody only in a relatively informal social context, it is quite likely that you will know only their chue len and not their formal given name or family name.

I think it would be quite challenging to design an address book application that deals with all this naturally. No application I've used does a good job and indeed it's not immediately obvious to me what the right approach to handling this is. (However, I suspect an approach based on adding markup to the display name will work better than trying to figure out a set of database fields.)

Of course, it becomes even more difficult if you want to deal with complexities that arise in other cultures. I'm sure that just as personal names in Thai culture have some features that are surprising from a Western perspective, there must be many other cultures where personal names have equally surprising features. I would love to learn more about these. If anybody can blog or comment with additional information, that would be great.

(Any Thais reading this, please feel free to add comments correcting anything I've got wrong or adding any important points I've missed.)

Strategies for using open source in the Thai software industry

2007-11-03T13:58:00.001+07:00

The following is adapted from the slides of a presentation I gave yesterday on how the Thai software industry can benefit from open source. I think a more important problem is how the country as a whole can benefit from open source, but that wasn't what I was asked to talk about. Also note that the objective here is not to help open source but to help the Thai software industry. I think most, if not all of this, is applicable to other countries at a stage of development similar to Thailand's.

Application platform

Applications need server platform, including
- OS
- Database
- Web server, framework
Open source server platform is at least as good in quality as proprietary platforms
Platform does not compete with local software industry
Using open source on the server does not require users to move away from familiar Windows desktop environment
Virtualization enables applications built on fully open source application platform to be deployed on Windows
Trend towards web-based applications, where everything is on the server
Avoids cost of platform software licenses, according to business model
- Licensing software: users save cost
- Appliance, software as a service: producer saves cost
Licensing issues
- Software as a service: no issues
- Licensing software: must keep separation between proprietary and open source parts (no linking)
- Appliance: must make some parts of source code available to customers
Mixed strategies also possible (e.g. Oracle on Linux, PHP on Windows)

Development tools

Traditional strength of open source
Java-based IDEs (e.g. Eclipse, NetBeans)
- Written in Java, but support many kinds of development in addition to Java, e.g. C/C++, Web
- Several companies adopting Eclipse as base (e.g. Nokia)
- Main advantage compared to Microsoft is no lock-in to Microsoft application platform
- Cost not the key issue: Microsoft makes development tools available to ISVs at low cost
Collaboration tools
- Open source community has evolved exceptionally effective collaboration tools because
  - it is highly distributed
  - it only adopts process to the extent that it actually delivers results
- Proprietary tools expensive
- Key tools
  1. Version control (CVS, Subversion, Mercurial)
  2. Issue tracking (Bugzilla, Trac)

Education and professional development

Participation in open source projects builds skills that universities often fail to teach
- Communication, especially English language
- Cooperation
- Working with large programs
- Modifying existing programs as opposed to creating new programs
Opportunity to work with world-class developers
Helps career of individual developer by building personal brand
- Opportunity to get work overseas
- Improves chances of getting into good US graduate school
Builds highly motivated developers with world-class skills, who wish to pursue technical career
Useful both at student and professional level
Should emphasize participation in existing, successful, international projects
Be highly selective about starting new projects
- Successful, large open source projects could help build image of sponsor organization or Thailand generally
- But very difficult to create a really successful, large open source projects
- Choose area where no open source solution is yet available; opportunities still exist
- Need to choose projects that can benefit rather than compete with local software industry
Individuals must choose projects they are passionate about

Embedded software

Hardware sales provide well-understood business model
Trend to Linux as OS for embedded systems
- Increased power of embedded devices
- Need for strong networking capabilities
Opportunity for electronics industry to move up the value chain

Fully open source business model

Product is fully open source
Possible for small company to achieve large market share because of
- No licensing cost
- Contribution of open source community
- Examples: JBoss, MySQL
Business model based on support, consulting, training
Not an easy strategy

E4X not in ES4

2007-10-31T09:29:00.001+07:00

I was surprised to find that ES4 does not fold in E4X (although it reserves the syntax). I had always viewed E4X as being one of the smoothest integrations of XML into a scripting language. However, it seems that once you dig a bit deeper, it has some problems.

Optional typing in ES4

2007-10-31T07:51:00.001+07:00

ES4 takes a very interesting approach to typing. They've added static typing but made it completely optional. Variable declarations can optionally be annotated with a type declaration. However, the variable declarations don't change the run-time semantics of the language. The only effect of the declarations is that if you run the program in strict mode, then the program will be verified before execution and rejected if type errors are found. Implementations don't have to support strict mode. You can still have simple, small footprint implementations that do all checks dynamically. Users who don't want to be bothered with types can write programs without having to learn anything about the type system.

There's a good paper by Gilad Bracha on Pluggable Type Systems that explains why type systems should be optional not mandatory. I think he's right. The dichotomy between statically and dynamically typed languages is false: an optional type system allows you to have the benefits of both. The paper goes further and argues that type systems should be not merely optional but pluggable. I'm not convinced on this. Pluggable type systems are a great idea if you are a language designer who wants to experiment with type systems; but for a production language, I think it's a fundamental responsibility of the language designer to choose a single type system.

Anyway, it's great to see optional typing being adopted by a mainstream language.

ECMAScript Edition 4

2007-10-31T07:47:00.001+07:00

The group working on the next version of ECMAScript (ES4) have released a language overview. There's a lively discussion on the mailing list about some of the politics behind the evolution of ES4. (The situation appears to be that Microsoft doesn't want major new features in ECMAScript, whereas Mozilla and Adobe want to evolve it rather dramatically.)

Signing HTTP requests

2007-10-29T20:07:00.001+07:00

When I first started thinking about signing HTTP responses, I assumed that signing HTTP requests was a fairly similar problem and that a single solution could deal with signing requests as well as responses. But after thinking about it some more, I'm not so sure.

The first thing to bear in mind is that signing an HTTP request or response is not an end in itself, but merely a mechanism to achieve a particular goal. The purpose of the proposal that I've been developing in this series of posts is to allow somebody that receives a representation of a resource to verify the integrity and origin of that representation; the mechanism for achieving this is signing HTTP responses.

The second thing to bear in mind is the advantages of this proposal over https. Realistically, there's not much point to a proposal in this space unless it has compelling advantages over https. There are two advantages that I find compelling:

better performance: clients can verify the integrity of responses without negatively impacting HTTP caching, whereas requests and responses that go over https cannot be cached by proxies;
persistent non-repudiation: by this I mean a client that verifies the integrity and origin of a resource can easily persist metadata that makes it possible to subsequently prove what was verified to a third party.

One key factor that allows these advantages is that the proposal does not provide confidentiality.

As compared to other approaches to signing messages (such as S/MIME), the key advantage is that the signature will be automatically ignored by clients that don't understand it, just by virtue of normal HTTP extensibility rules.

If we turn to signing HTTP requests, or more specifically HTTP GET requests, none of the above considerations apply.

The goal of signing an HTTP GET request is typically to allow the server to restrict access to resources.
If you're really serious about restricting access to resources, and you want to protect against malicious proxies, then you will want to protect the confidentiality of the response; if the request includes a signature that says x authorizes y to access resource r at time t, then the representation of r in the response ought to be encrypted using y's public key.
Furthermore, if a server is restricting access to resources, then the signature on the request can't be optional, so the advantage over other message signing approaches such as S/MIME disappears.
Adding signatures to HTTP GET requests is inherently going to inhibit caching. A cached response to a request signed by x for resource r cannot in general be used to respond to a request signed by y for resource r.
Neither of the compelling advantages (better performance, and persistent non-repudiation) which I mentioned above applies any longer.

On the other hand, if we consider signing HTTP PUT (and possibly POST) requests, then there seems to be more commonality. Signing an HTTP PUT request serves the goal of allowing the server to verify the integrity and origin of the representation of a resource transferred from the client. Although I don't think there will be a significant performance advantage over https, persistent non-repudiation could be useful.

I think my conclusion is that it's better to think of the proposal not as a proposal for signing HTTP responses, but as a proposal for allowing verification of the origin and integrity of transfers of representations of resources. When considered in this light, signing of HTTP GET requests doesn't really fit in.

By the way, I'm not saying HTTP request signing isn't a useful technique. For example, OAuth is using it to solve an important problem: allowing users to grant applications limited access to private resources. But I think that's a very different problem from the problem that I'm trying to solve.