2008-11-17

What's allowed in a URI?

Java 1.4 introduced the java.net.URI which provides RFC 2936-compliant URI handling. I thought I should try to fix Jing and Trang to use this. So I've been looking through all the relevant specs to figure out to what extent I can leave things to java.net.URI.

It's convenient to begin with XLink.  Section 5.4 requires the value of the href attribute to be a URI reference after certain characters that are disallowed by RFC 2396 are escaped. These are described as

all non-ASCII characters, plus the excluded characters listed in Section 2.4 of IETF RFC 2396, except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in IETF RFC 2732

If we look at 2.4.3 of RFC 2396 (why does XLink reference section 2.4 rather than 2.4.3?), we see the following sets of characters excluded:

  • control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
  • space       = <US-ASCII coded character 20 hexadecimal>
  • delims      = "<" | ">" | "#" | "%" | <">
  • unwise     = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Section 3 of RFC 2732 (which modifies RFC 2396 to handle IPv6 addresses)  does indeed allow square brackets by removing them from the 'unwise' set.

Putting these all together, we can distinguish the following categories of characters that are allowed by XLink but not allowed by RFC 2396/RFC 2732

  1. C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents
  2. space (#x20)
  3. disallowed ASCII graphic characters, specifically: <>"{}|\^`
  4. delete (#x7F)
  5. non-ASCII Unicode characters, excluding surrogates #x80-#xD7FF, #xE000-#x10FFFF (XML does not allow #xFFFE and #xFFFF)

Looking at the various XML-related specs, things seem to be nicely aligned:

XSLT 1.0 just references RFC 2396 and doesn't say anything about escaping (as regards xsl:include and xsl:import). That seems like a bug to me.  Erratum E39 adds the following to the first paragraph of the spec:

For convenience, XML 1.0 and XML Names 1.0 references are usually used. Thus, URI references are also used though IRI may also be supported. In some cases, the XML 1.0 and XML 1.1 definitions may be exactly the same.

This seems to be intended to extend it to allow IRIs, though it seems like a bit of a hack: there's no reference to the IRI spec, and I don't see how it's "Thus, ". In any case, XSLT 2.0 gets it right: it references xs:anyURI.

RFC 2396 has been updated by RFC 3986.  This no longer has a section describing excluded characters, but I believe I am right in saying that the set of Unicode characters that cannot occur anywhere in a URI as defined by RFC 3986 is precisely the union of my categories 1 through 5.

Next we have the IRI spec, RFC 3987. This defines:

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

It adds ucschar to the set of unreserved characters and adds iprivate to what's allowed in the query of a URI. The characters in my category 5 that are in neither ucschar nor iprivate are as follows:

  • C1 controls: #x80 - #x9F
  • the 66 Unicode noncharacters: #xFDD0 - #xFDEF, and any code point whose bottom 16 bits are FFFE or FFFF
  • Specials: #xFFF0 - #xFFFD; these fall into three groups, unassigned specials (#xFFF0 - #xFFF8), annotation characters (#xFFF9 - #xFFFB) and replacement characters (#xFFFC - #xFFFD)
  • Language tags: #xE0000 - #xE0FFF

I can buy controls and noncharacters being excluded, but the other two seem like over-engineering to me. The arguments for excluding these could equally be applied to various other weird Unicode characters.  You don't want to have to change the definition of an IRI whenever Unicode adds some new weird character.

RFC 3987 also has the following in Section 3.2:

Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", "}", "|", "\", "^", and "`"

Those characters correspond to my categories 2 and 3. Overall there are a lot of subtle differences between IRIs and the thing that is currently allowed by XML specs.

Fortunately there is a draft of a new version of the IRI spec. This introduces Legacy Extended IRI (LEIRI) references, which defines ucschar as:

   ucschar        = " " / "<" / ">" / '"' / "{" / "}" / "|"
                     / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
                     / %xE000-FFFD / %x10000-10FFFF

which exactly corresponds to my categories 1 to 5.

LEIRIs seem like a very useful innovation.  XML-related specs such as RELAX NG that referenced or incorporated the XLink wording will be able to simply reference RFC 3987bis and say that URI references MUST be LEIRIs and SHOULD be IRIs.

Finally we are ready to look at java.net.URI. This allows URIs to contain an additional set of "other" characters which consist of non-ASCII characters with the exception of:

  • C1 controls (#x80 - #x9F)
  • Characters with a category of Zs, Zl or Zp

This means that if you want to give an LEIRI such as an XML system identifier to java.net.URI you first need to percent encode any of the following:

  • the following ASCII graphic characters: <>"{}|\^`
  • C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents
  • space (#x20)
  • delete (#x7F)
  • C1 controls (#x80 - #x9F)
  • Characters with a category of Zs, Zl or Zp

All except the first can be tested with Character.isISOControl(c) || Character.isSpace(c).

Note that you don't want to blindly percent encode all non-ASCII characters because that will unnecessarily make IRIs containing non-ASCII characters unintelligible to humans.

5 comments:

Damian said...

You could save some work by looking at the jena IRI library. Jeremy Carroll spent some time navigating the numerous RFCs, and I believe the code is partially driven by them (?).

You can find him at: jeremy (at) topquadrant.com

村田 said...

This is extremely helpful. Thanks.

elharo said...

James,

RFC 2396 (and java.net.URI) are seriously out of date. Please upgrade to RFC 3986, and consider either borrowing or rolling your own URI handling code.

Martin Probst said...

I wonder if writing a compliant implementation to some spec has to be this hard. In particular the areas related to XML schema always seem horribly confusing.

Some specs are certainly better written than others. Still, it might be nice if someone could wrap up all the confusing cross-references every now and then... this would probably greatly increase the number of conforming implementations (but who has the time ...).

Henry S. Thompson said...

Thanks for working through this. Maybe the Java folks can be persuaded to take notice.

BTW, RFC3987bis has been 'delayed'. In the interim, the XML Core WG, with permission from the RFC3987bis editors, has published the LEIRI section as a W3C Working Group Note: Legacy extended IRIs for XML resource identification