James Clark's Random Thoughts

Sunday, December 9, 2007

HTTPbis

Mark Nottingham explains the work being done in the IETF to revise HTTP. It sounds to me like they're doing exactly the right thing, focusing on producing a better spec that brings light to some of the darker corners of the protocol and reduces the gap between what the spec says and what you actually need to implement to achieve interoperability. It's good to see that capable people have stepped up to put in the not inconsiderable time and effort that's needed for this unglamorous but very useful work.

Friday, December 7, 2007

Thai personal names

There's an election coming up in Thailand on December 23rd and the streets are lined with election posters.  As a bit of an i18n geek, I find it interesting that the posters almost all make the candidates' first names at least twice as big as their last names.  If you're also an i18n geek, your reaction might well be: "it must be because Thais write their family name first, followed by their given name". But you would be wrong.  Thais have a given name and a family name; the given name is written first, and the family name last.

The correct explanation that given names play a role in Thai culture that is similar to the role that family names play in many Western cultures. The polite way to address somebody is with an honorific followed by their given name. The Thai telephone book is sorted with given names as the primary key and family names as the secondary key.

(I have to say that this has led me to question what I perceive to be the i18n orthodoxy that it's more i18n-ly correct to talk of given name/family name than first name/last name. Why does it matter whether a name is a family name or a given name? Surely what matters is the cultural role that the name plays.)

I guess that historically the main reason for the dominance of given names in Thai culture is because family names are a relatively recent innovation: they were introduced by King Rama VI towards the beginning of the 20th century. Family names were allocated to families systematically and the use of family names is still controlled by the government. Any two people in Thailand with the same family name are related. This leads to Thai family names being quite a mouthful.  Here's a sample from people in the news over the past couple of days: Leophairatana, Tantiwittayapitak, Boonyaratkalin. Even Thais have difficulty remembering each others family names.

If you become a Thai citizen, you have to choose a new, unused family name.  Just as with domain names, all the good, short names have gone. So the more recently your family has become Thai, the longer and more unwieldy your family name is likely to be.

Thai given names usually have at least two or three syllables. There aren't any given names that are as commonly used in Thai culture as the most popular given names in Western cultures.  I've never come across a situation where two living Thais share the same given name and family name. You would certainly never get the situation of hundreds of people having the same given name and family name (like "James Clark").

Thais rarely use the First.Last@domain convention for email.  It would be too unwieldy. The conventions I've seen most often are First.La@domain and First.L@domain (i.e. use only the first one or two characters of the last name).

Another I18N wrinkle is that Thais' official first and given names are in Thai script not in Roman script. But in many situations Thais use romanized versions of their names.  And while there is a standard way (actually several standard ways) of romanizing Thai, the convention is that the correct romanization of any personal name is what the holder of the name wishes it to be. (Thus, your application may need to store two versions of names: the Thai script version and the romanized version.)

With honorifics, I think the nastiest gotcha from an i18n perspective is that, while the given and family name are conventionally written separated by a space, there is no separator between the honorific and the given name. (Words in Thai are normally not separated by spaces.) This applies only in Thai script.  When romanized, you would need a space between the honorific and the given name.

Since given names are used in Thai culture somewhat like family names are used in some Western cultures, you might be wondering what serves the role that given names serve in Western cultures.  All Thais have a name referred to as a "chue len". This is typically translated as "nickname", but it has a more important role in Thai culture than a nickname does in Western culture.  I think it would be more accurate to describe it as an "informal given name". Parents give each of their children a chue len, in addition to a formal given name.  You would typically use a chue len to address somebody in contexts where in England you might use their first name.

Whereas formal given names are restricted to names that the bureaucrats of the interior ministry deem appropriate, parents can and do follow their personal whims when it come the chue len. For example, a former employee of mine was called "Mote", which was abbreviated from "remote", as in TV remote control. (This illustrates another interesting aspect of Thai culture: words are commonly shortened by omitting all except the last syllable. For example, a kilo is often referred to as a  "lo".)

In perhaps 80% of cases the chue len is a single syllable. It's often very difficult to romanize these.  Thai has tones as well as one of the richest collection of vowels of any language. Most romanization schemes don't preserve subtle differences in tones and vowels.  Whereas this is workable with formal given names and family names, which usually have many syllables and some redundancy, if you don't get the vowel or tone of a chue len exactly right, it becomes another name. For example, another of my employees has a name that sound like the second syllable of the word "apple", but with the "l" changed to a "n", and pronounced in an emphatic (falling) tone. I can write that sound unambiguously in Thai, but I've no idea how to write it in English.

Occasionally the chue len is a shortened version of the given name, but more often it is completely unrelated.  If you know somebody only in a relatively informal social context, it is quite likely that you will know only their chue len and not their formal given name or family name.

I think it would be quite challenging to design an address book application that deals with all this naturally.  No application I've used does a good job and indeed it's not immediately obvious to me what the right approach to handling this is.  (However, I suspect an approach based on adding markup to the display name will work better than trying to figure out a set of database fields.)

Of course, it becomes even more difficult if you want to deal with complexities that arise in other cultures. I'm sure that just as personal names in Thai culture have some features that are surprising from a Western perspective, there must be many other cultures where personal names have equally surprising features.  I would love to learn more about these. If anybody can blog or comment with additional information, that would be great.

(Any Thais reading this, please feel free to add comments correcting anything I've got wrong or adding any important points I've missed.)

Saturday, November 3, 2007

Strategies for using open source in the Thai software industry

The following is adapted from the slides of a presentation I gave yesterday on how the Thai software industry can benefit from open source. I think a more important problem is how the country as a whole can benefit from open source, but that wasn't what I was asked to talk about. Also note that the objective here is not to help open source but to help the Thai software industry. I think most, if not all of this, is applicable to other countries at a stage of development similar to Thailand's.

Application platform
  • Applications need server platform, including
    • OS
    • Database
    • Web server, framework
  • Open source server platform is at least as good in quality as proprietary platforms
  • Platform does not compete with local software industry
  • Using open source on the server does not require users to move away from familiar Windows desktop environment
  • Virtualization enables applications built on fully open source application platform to be deployed on Windows
  • Trend towards web-based applications, where everything is on the server
  • Avoids cost of platform software licenses, according to business model
    • Licensing software: users save cost
    • Appliance, software as a service: producer saves cost
  • Licensing issues
    • Software as a service: no issues
    • Licensing software: must keep separation between proprietary and open source parts (no linking)
    • Appliance: must make some parts of source code available to customers
  • Mixed strategies also possible (e.g. Oracle on Linux, PHP on Windows)
Development tools
  • Traditional strength of open source
  • Java-based IDEs (e.g. Eclipse, NetBeans)
    • Written in Java, but support many kinds of development in addition to Java, e.g. C/C++, Web
    • Several companies adopting Eclipse as base (e.g. Nokia)
    • Main advantage compared to Microsoft is no lock-in to Microsoft application platform
    • Cost not the key issue: Microsoft makes development tools available to ISVs at low cost
  • Collaboration tools
    • Open source community has evolved exceptionally effective collaboration tools because
      • it is highly distributed
      • it only adopts process to the extent that it actually delivers results
    • Proprietary tools expensive
    • Key tools
      1. Version control (CVS, Subversion, Mercurial)
      2. Issue tracking (Bugzilla, Trac)
Education and professional development
  • Participation in open source projects builds skills that universities often fail to teach
    • Communication, especially English language
    • Cooperation
    • Working with large programs
    • Modifying existing programs as opposed to creating new programs
  • Opportunity to work with world-class developers
  • Helps career of individual developer by building personal brand
    • Opportunity to get work overseas
    • Improves chances of getting into good US graduate school
  • Builds highly motivated developers with world-class skills, who wish to pursue technical career
  • Useful both at student and professional level
  • Should emphasize participation in existing, successful, international projects
  • Be highly selective about starting new projects
    • Successful, large open source projects could help build image of sponsor organization or Thailand generally
    • But very difficult to create a really successful, large open source projects
    • Choose area where no open source solution is yet available; opportunities still exist
    • Need to choose projects that can benefit rather than compete with local software industry
  • Individuals must choose projects they are passionate about
Embedded software
  • Hardware sales provide well-understood business model
  • Trend to Linux as OS for embedded systems
    • Increased power of embedded devices
    • Need for strong networking capabilities
  • Opportunity for electronics industry to move up the value chain
Fully open source business model
  • Product is fully open source
  • Possible for small company to achieve large market share because of
    • No licensing cost
    • Contribution of open source community
    • Examples: JBoss, MySQL
  • Business model based on support, consulting, training
  • Not an easy strategy

Wednesday, October 31, 2007

E4X not in ES4

I was surprised to find that ES4 does not fold in E4X (although it reserves the syntax).  I had always viewed E4X as being one of the smoothest integrations of XML into a scripting language.  However, it seems that once you dig a bit deeper, it has some problems.

Optional typing in ES4

ES4 takes a very interesting approach to typing.  They've added static typing but made it completely optional.   Variable declarations can optionally be annotated with a type declaration.  However, the variable declarations don't change the run-time semantics of the language.  The only effect of the declarations is that if you run the program in strict mode, then the program will be verified before execution and rejected if type errors are found.  Implementations don't have to support strict mode.  You can still have simple, small footprint implementations that do all checks dynamically.  Users who don't want to be bothered with types can write programs without having to learn anything about the type system.

There's a good paper by Gilad Bracha on Pluggable Type Systems that explains why type systems should be optional not mandatory.   I think he's right. The dichotomy between statically and dynamically typed languages is false: an optional type system allows you to have the benefits of both. The paper goes further and argues that type systems should be not merely optional but pluggable.  I'm not convinced on this.  Pluggable type systems are a great idea if you are a language designer who wants to experiment with type systems; but for a production language, I think it's a fundamental responsibility of the language designer to choose a single type system.

Anyway, it's great to see optional typing being adopted by a mainstream language.

ECMAScript Edition 4

The group working on the next version of ECMAScript (ES4) have released a language overview.  There's a lively discussion on the mailing list about some of the politics behind the evolution of ES4. (The situation appears to be that Microsoft doesn't want major new features in ECMAScript, whereas Mozilla and Adobe want to evolve it rather dramatically.)

Monday, October 29, 2007

Signing HTTP requests

When I first started thinking about signing HTTP responses, I assumed that signing HTTP requests was a fairly similar problem and that a single solution could deal with signing requests as well as responses.  But after thinking about it some more, I'm not so sure.

The first thing to bear in mind is that signing an HTTP request or response is not an end in itself, but merely a mechanism to achieve a particular goal.  The purpose of the proposal that I've been developing in this series of posts is to allow somebody that receives a representation of a resource to verify the integrity and origin of that representation; the mechanism for achieving this is signing HTTP responses.

The second thing to bear in mind is the advantages of this proposal over https.  Realistically, there's not much point to a proposal in this space unless it has compelling advantages over https. There are two advantages that I find compelling:

  • better performance: clients can verify the integrity of responses without negatively impacting HTTP caching, whereas requests and responses that go over https cannot be cached by proxies;
  • persistent non-repudiation: by this I mean a client that verifies the integrity and origin of a resource can easily persist metadata that makes it possible to subsequently prove what was verified to a third party.

One key factor that allows these advantages is that the proposal does not provide confidentiality.

As compared to other approaches to signing messages (such as S/MIME), the key advantage is that the signature will be automatically ignored by clients that don't understand it, just by virtue of normal HTTP extensibility rules.

If we turn to signing HTTP requests, or more specifically HTTP GET requests, none of the above considerations apply.

  • The goal of signing an HTTP GET request is typically to allow the server to restrict access to resources.
  • If you're really serious about restricting access to resources, and you want to protect against malicious proxies, then you will want to protect the confidentiality of the response; if the request includes a signature that says x authorizes y to access resource r at time t, then the representation of r in the response ought to be encrypted using y's public key.
  • Furthermore, if a server is restricting access to resources, then the signature on the request can't be optional, so the advantage over other message signing approaches such as S/MIME disappears.
  • Adding signatures to HTTP GET requests is inherently going to inhibit caching. A cached response to a request signed by x for resource r cannot in general be used to respond to a request signed by y for resource r.
  • Neither of the compelling advantages (better performance, and persistent non-repudiation) which I mentioned above applies any longer.

On the other hand, if we consider signing HTTP PUT (and possibly POST) requests, then there seems to be more commonality.  Signing an HTTP PUT request serves the goal of allowing the server to verify the integrity and origin of the representation of a resource transferred from the client.  Although I don't think there will be a significant performance advantage over https, persistent non-repudiation could be useful.

I think my conclusion is that it's better to think of the proposal not as a proposal for signing HTTP responses, but as a proposal for allowing verification of the origin and integrity of transfers of representations of resources. When considered in this light, signing of HTTP GET requests doesn't really fit in.

By the way, I'm not saying HTTP request signing isn't a useful technique. For example, OAuth is using it to solve an important problem: allowing users to grant applications limited access to private resources. But I think that's a very different problem from the problem that I'm trying to solve.

About Me