URI.js

Understanding URIs

Uniform Resource Identifiers (URI) can be one of two things, a Uniform Resource Locator (URL) or a Uniform Resource Name (URN). You likely deal with URLs most of the time. See RFC 3986 for a proper definition of the terms URI, URL and URN

URLs are used to address the individual resources of your website. URNs are usually used for hooking into other applications, as mailto:, magnet: or spotify: suggest. While RFC 3986 defines the structure of an URL in depth, URNs are not. The structure (and meaning) of URNs are up to their distinct specifications.

Components of an URI

RFC 3986 Section 3 visualizes the structure of URIs as follows:

URL:      foo://example.com:8042/over/there?name=ferret#nose
          \_/   \______________/\_________/ \_________/ \__/
           |           |            |            |        |
        scheme     authority       path        query   fragment
           |   _____________________|__
          / \ /                        \
URN:      urn:example:animal:ferret:nose

Components of an URL in URI.js

                         authority
                   __________|_________
                  /                    \
              userinfo                host                          resource
               __|___                ___|___                 __________|___________
              /      \              /       \               /                      \
         username  password     hostname    port     path & segment      query   fragment
           __|___   __|__    ______|______   |   __________|_________   ____|____   |
          /      \ /     \  /             \ / \ /                    \ /         \ / \
    foo://username:[email protected]:123/hello/world/there.html?name=ferret#foo
    \_/                     \ / \       \ /    \__________/ \     \__/
     |                       |   \       |           |       \      |
  scheme               subdomain  \     tld      directory    \   suffix
                                   \____/                      \___/
                                      |                          |
                                    domain                   filename

In Javascript the query is often referred to as the search. URI.js provides both accessors with the subtle difference of .search() beginning with the ?-character and .query() not.

In Javascript the fragment is often referred to as the hash. URI.js provides both accessors with the subtle difference of .hash() beginning with the #-character and .fragment() not.

Components of an URN in URI.js

    urn:example:animal:ferret:nose?name=ferret#foo
    \ / \________________________/ \_________/ \ /
     |               |                  |       |
  scheme       path & segment         query   fragment

While RFC 3986 does not define URNs having a query or fragment component, URI.js enables these accessors for convenience.

URLs - Man Made Problems

URLs (URIs, whatever) aren't easy. There are a couple of issues that make this simple text representation of a resource a real pain

Look simple but have tricky encoding issues
Domains aren't part of the specification
Query String Format isn't part of the specification
Environments (PHP, JS, Ruby, …) handle Query Strings quite differently

Parsing (seemingly) invalid URLs

Because URLs look very simple, most people haven't read the formal specification. As a result, most people get URLs wrong on many different levels. The one thing most everybody screws up is proper encoding/escaping.

http://username:pass:[email protected]/ is such a case. Often times homebrew URL handling misses escaping the less frequently used parts such as the userinfo.

http://example.org/@foo that "@" doesn't have to be escaped according to RFC3986. Homebrew URL handlers often just treat everything between "://" and "@" as the userinfo.

some/path/:foo is a valid relative path (as URIs don't have to contain scheme and authority). Since homebrew URL handlers usually just look for the first occurence of ":" to delimit the scheme, they'll screw this up as well.

+ is the proper escape-sequence for a space-character within the query string component, while every other component prefers %20. This is due to the fact that the actual format used within the query string component is not defined in RFC 3986, but in the HTML spec.

There is encoding and strict encoding - and Javascript won't get them right: encodeURIComponent()

Top Level Domains

The hostname component can be one of may things. An IPv4 or IPv6 address, an IDN or Punycode domain name, or a regular domain name. While the format (and meaning) of IPv4 and IPv6 addresses is defined in RFC 3986, the meaning of domain names is not.

DNS is the base of translating domain names to IP addresses. DNS itself only specifies syntax, not semantics. The missing semantics is what's driving us crazy here.

ICANN provides a list of registered Top Level Domains (TLD). There are country code TLDs (ccTLDs, assigned to each country, like ".uk" for United Kindom) and generic TLDs (gTLDs, like ".xxx" for you know what). Also note that a TLD may be non-ASCII .香港 (IDN version of HK, Hong Kong).

IDN TLDs such as .香港 and the fact that any possible new TLD could pop up next month has lead to a lot of URL/Domain verification tools to fail.

Second Level Domains

To make Things worse, people thought it to be a good idea to introduce Second Level Domains (SLD, ".co.uk" - the commercial namespace of United Kingdom). These SLDs are not up to ICANN to define, they're handled individually by each NIC (Network Information Center, the orgianisation responsible for a specific TLD).

Since there is no central oversight, things got really messy in this particular space. Germany doesn't do SDLs, Australia does. Australia has different SLDs than the United Kingdom (".csiro.au" but no ".csiro.uk"). The individual NICs are not required to publish their arbitrarily chosen SLDs in a defined syntax anywhere.

You can scour each NIC's website to find some hints at their SLDs. You can look them up on Wikipedia and hope they're right. Or you can use PublicSuffix.

Speaking of PublicSuffix, it's time mentioning that browser vendors actually keep a list of known Second Level Domains. They need to know those for security issues. Remember cookies? They can be read and set on a domain level. What do you think would happen if "co.uk" was treated as the domain? amazon.co.uk would be able to read the cookies of google.co.uk. PublicSuffix also contains custom SLDs, such as .dyndns.org. While this makes perfect sense for browser security, it's not what we need for basic URL handling.

TL;DR: It's a mess.

The Query String

PHP (parse_str()) will automatically parse the query string and populate the superglobal $_GET for you. ?foo=1&foo=2 becomes $_GET = array('foo' => 2);, while ?foo[]=1&foo[]=2 becomes $_GET = array('foo' => array("1", "2"));.

Ruby's CGI.parse() turns ?a=1&b=2 into {"a" : ["1", "2"]}, while Ruby on Rails chose the PHP-way.

Python's parse_qs() doesn't care for [] either.

Most other languages don't follow the []-style array-notation and deal with this mess differently.

TL;DR: You need to know the target-environment, to know how complex query string data has to be encoded

The Fragment

Given the URL http://example.org/index.html#foobar, browsers only request http://example.org/index.html, the fragment #foobar is a client-side thing.