Web Focus: Guidelines for URI Naming Policies

brian kelly

Web Focus: Guidelines for URI Naming Policies

Brian Kelly with some guidelines For URI naming policies in his regular column.

“Cool URIs”

What are “cool URIs”? This term comes from advice provided by W3C (the World Wide Web Consortium). The paper “Cool URIs don’t change” [1] begins by saying:

What makes a cool URI?: A cool URI is one which does not change.
What sorts of URI change?: URIs don’t change: people change them.

All Web users will, sadly, be familiar with the 404 error message. But, as W3C point out, the 404 error message does not point to a technical failure but a human one - hence the warning: “URIs don’t change: people change them.”.

In order to minimise the numbers of 404 error messages on their Web site organisations should develop URI naming guidelines which aim to minimise (or, ideally, eliminate) the need to change URIs. This article provides advice on URI naming policies.

At this point you may be confused with use of the term URI; surely I should be using the term URL? As the W3C document on “Web Naming and Addressing Overview” makes clear: URI (Uniform Resource Identifier) refers to the “generic set of all names/addresses that are short strings that refer to resources” whereas URL (Uniform Resource Locator) is “an informal term (no longer used in technical specifications) associated with popular URI schemes: http, ftp, mailto, etc.” [2]

This article uses the term URI. However readers may regard it as a synonym for URL (although this is not strictly correct).

Advice

Organisations should aim to provide persistent naming policies for resources on their Web sites. This is to ensure that visitors to your Web site do not get 404 error messages after you have restructured your Web site.

Why should you wish to restructure your Web site? Possible reasons for this include:

Your organisation has restructured (e.g. departments merged or shut down)
The initial structure of your Web site was not scalable.
You have moved from a distributed Web server environment to a centralised one (or vice versa).

All of these are possible scenarios, especially within the Higher Education community. There is little a Web manager can do about organisational restructuring. However the knowledge that this is a possibility should be taken into account when developing a URI naming policy for a Web site. One possible solution would be to use URIs which reflect the functionality of the service, rather than the organisational structure which provide the service. This is, of course, fairly standard advice for the design of the interface and navigational structure of a Web site - however it is even more important for URI naming as you can’t provide alternative views as you can with the user interface (e.g. an interface based on the organisation’s structure together with site map which gives a user-oriented view).

If you need to restructure because the initial structure of your Web site was not scalable then you will have learnt the importance of spending time on planning and designing your URI naming policy.

As well as issues of the design of a scalable structure for URIs on your Web site (which may be difficult to solve and for which it is difficult to give anything but fairly general advice) there are also issues about the technical infrastructure. For example:

You have changed the backend architecture used to provide your Web site.
You have replaced a backend application.
You wish to make use of a new file format.
You wish to provide automated options for resources, based on the end user’s environment or preferences.
You wish to provide a bilingual Web site.
Usability testing shows that your existing URI naming policy causes problems for end users.

Ideally URIs on your Web site should be independent of the backend technologies used to deliver the Web services. The Office of the E-Envoy’s paper on “E-Government Interoperability Framework” [3] invites readers to “consult the web site for the latest version of the e-GIF specification at http://www.govtalk.gov.uk/interoperability/egif.asp”.

This URL is clearly dependent on Microsoft’s ASP (Active Server Pages) technology. If the E-Envoy’s Office chose, at some time in the future, to deploy an alternative backend technology (such as PHP, Java Server Pages, Cold Fusion, etc.) in all probability it would be forced to make changes to the file extension.

As well as avoiding dependencies on backend server scripting technologies it is also desirable that resources which are generated from a database have persistent and, ideally, static-looking URIs. The URI of the E-Envoy’s paper mentioned above is:

http://www.govtalk.gov.uk/rfc/rfc_document.asp?docnum=505

As well as the dependency on ASP technology this resource appears to be number 505 in the backend database and is identified in the database by the name docnum. If the database is reorganised, so that entries get a new unique identifier or the names of field in the database change this URI will no longer be valid (or, even worse, it could point to a different resource).

At least the URI mentioned above did not include the name of the database software (SQLServer, MS Access, or whatever). URIs should aim to be independent of the application which provides a service. Recent surveys of search engines used to index Web sites in UK Universities [4] show how often software can change. All to often institutions provide a URI for a search facility in the form http://www.foo.ac.uk/htdig/. It is interesting to note that a number of institutions which have changed the search engine have moved to a neutral URI, such as http://www.foo.ac.uk/search/.

It is also advisable to avoid the use of cgi-bin directory names in URIs. W3C’s document on “Cool URIs Don’t Change” [1] gives the following example:

NSF Online Documents: http://www.nsf.gov/cgi-bin/pubsys/browser/odbrowse.pl

Although this is meant to be the main location for looking for documents, the URI does not look as it it will be particularly persistent. If they wish to use an alternative mechanism for managing access to their documents they will probably have to change the name of the odbrowse.pl script (a Perl script). They may also have to remove the cgi-bin directory name as a service which currently makes use of CGI technology could be replaced by an alternative technology (e.g. PHP scripting, Apache modules, etc.)

In this example we can see the danger of file name suffix - the functionality of the odbrowse.pl script could be replaced by a script in another language. In general there is a danger with file name suffices - today’s popular file format may become tomorrow’s legacy format.

Ideally you should avoid providing a link directly to a proprietary file format. For example if you wish to provide access to a PowerPoint file you could point to a HTML resource which has a link to the PowerPoint file and to a HTML derivative of the file. The use of a HTML intermediary allows you to provide additional information, such as the file size (users at home may not wish to download a large file), the PowerPoint version (there is no point in downloading a Power pont 2000 file if you only have PowerPoint 4 locally), etc.

The dangers of file suffices which reflect a particular scripting language or a not-native, proprietary file format have been mentioned. It should also be pointed out that there are also dangers with HTML files!

Unfortunately there is no consistency in whether HTML files have the extension .html or .htm. Although links can be checked there can be a danger if users write down a URI.

One solution to this problem is to make use of directories and the directory default name for resources. For example this article has the URI http://www.ariadne.ac.uk/issue31/web-focus/. This has the advantage of being less prone to errors than http://www.ariadne.ac.uk/issue31/web-focus/article.html (or …/article.htm). This form also has the advantage of being shorter. It should be noted that it will be important to refer only to the name http://www.ariadne.ac.uk/issue31/web-focus/ and not to http://www.ariadne.ac.uk/issue31/web-focus/intro.html as the latter not only not only is longer and potentially prone to errors, it is also reliant on a server configuration option which uses intro.html as the default file name for directories. Other servers sometime use other names, such as welcome.html or index.html.

We have seen that it is sensible to avoid use of file name suffixes for scripting languages, as this provides extra flexibility. It may not be obvious that we may wish to have the flexibility to migrate away from HTML files!

Will HTML still be around in 20 years time? Possibly. Will a replacement for HTML be around in 20 years time? Certainly, I would say. XHTML is W3C’s current preferred version of HTML. It has the advantage of being an XML application. We are likely to see much greater use of XML in the future, and not only at the server end (with XML resources transformed into HTML for delivery to the user). At the user end we can expect to see XML formats such as SMIL for synchronous multimedia [4], SVG for vector graphics [5], MathML and CML [6] [7] for use in specialist disciplines. In order to integrate these different applications in a reliable way it will probably be necessary to make use of a modular form of XHTML.

As well as XHTML we could also see other XML applications being delivered to a Web browser and displayed through use of CSS. It is already possible to do this in Internet Explorer as can be seen if you use a recent version of this browser to view, say, the RSS news feed provided by W3C’s QA group [8].

How can you possibly avoid using a .html suffix, even if this could be of some use on the long term? One way is to make use of directory defaults as described previously. Although this has some advantages, it is not probably practical to store every HTML resource in its own directory.

An alternative is to make use of content negotiation. In simple terms you create a resource labelled foo. When a user follows a link to foo the server will say something along the lines of I’ve got foo.html and foo.xml. I see your browser supports XML, so I’ll give you the XML version.

The W3C use this approach to provide alternate versions of images. Examining the source of the W3C home page you will see that the image tag has the form:

“<img alt=“World Wide Web Consortium (W3C)” … src=“Icons/w3c_main” />

Further investigation shows that W3C provide a GIF image (at http://www.w3.org/Icons/w3c_main.gif) and a PNG version of the image at http://www.w3.org/Icons/w3c_main.png). In this example W3C have future-proofed themselves against the need to reorganise their Web site if, say. they are forced to remove GIF images (as could happen as use of GIF requires that you use a properly licensed graphical tool to create GIF images or you pay a licence fee [09].

As well as using this approach with images, it can also be applied to HTML resources. For example W3C’s HyperText Markup Language Activity Statement [10] has the URL http://www.w3.org/MarkUp/Activity. Further investigation reveals that this is not a directory (http://www.w3.org/MarkUp/Activity/ does not exist). In fact the physical resource is located at http://www.w3.org/MarkUp/Activity.html. W3C do not point directly to this file - instead they make use of content negotiation to access the file. So if they decide to provide access to, say, a http://www.w3.org/MarkUp/Activity.xhtml or http://www.w3.org/MarkUp/Activity.xml or even (although unlikely on the W3C Web site) http://www.w3.org/MarkUp/Activity.pdf they can do so without breaking existing URIs.

It should be noted that this approach can also be taken to making different language versions of resources available. So a link to http://www.foo.ac.uk/prospectus could be used to serve http://www.foo.ac.uk/prospectus.en.html (the default English version) or a French version at http://www.foo.ac.uk/prospectus.fr.html to a user who’s browser was configured to use the French language as the default (see [11]).

Details of using content negotiation will not be included here. Further information is available on using content negotiation in Apache is available [12] [13].

Technical Issues

It could be argued that getting URIs structure correct in the first place is not really an issue, as you can always use redirects to point users to a new structure.

Redirects can be achieved in a number of ways. Within a HTML page authors can add a <meta http-equiv=“refresh” content=“10;URL=http://www.foo.ac.uk/"> tag to their page. In many (but not all) browsers this will redirect to the specified URL after the specified time (10 seconds in this case).

It should be noted that the HTML 4.0 specification states that “Some user agents support the use of META to refresh the current page after a specified number of seconds, with the option of replacing it by a different URI. Authors should not use this technique to forward users to different pages, as this makes the page inaccessible to some users. Instead, automatic page forwarding should be done using server-side redirects.” [14].

Another client-side alternative is to make use of JavaScript code such as:

<body onload=“document.location=‘http://www.foo.ac.uk/'">

However both of these approaches have their limitations: this is not a scalable solution which can be used for hundreds of files; a redirect time of 0 will cause the back button to stop working; they are browser dependent or require use of JavaScript; they may not work for certain user agents, such as indexing robots. Further comments are given at [15].

Server redirects would appear to provide a more useful solution. A redirect is an Web server directive (command) which maps one URL into another. The new URL is returned to the browser which attempts to fetch it again with the new address. As an example the University of Southern California have used this approach to provide shorter or more logical URLs on their Web site [16].

The Apache documentation describes how to provide server redirects [17] and a third party tutorial provides further information [18]. A third-party product is available which provides similar functionality for the IIS server [19].

So does this provide an answer to the problems of reorganising our Web site? The answer, unfortunately, is no - or more correctly, it will solve some problems but may cause some new ones.

Server redirects work by sending the address of the new location to the browser. The browser then sends the new location back to the server. Although most modern browsers will support there is a danger that other user agents may not.

Also use of server redirects means that a definition of the structure of the Web site is held in a server configuration file. A tool which processes the underlying file structure will not be aware of the Web view and so will provide incorrect information. This could affect indexing tools, auditing tools, etc.

A related issue is that it will be more difficult to mirror the Web site. If a Web site is mirrored, the mirroring of the redirect information may be difficult to manage (although this depends on the way in which the Web site is mirrored).

Conclusions

There is probably no simple technological fix to providing stable URIs for resources. It will therefore be necessary to go through a process of developing a URI naming policy which will address the issues mentioned in this article.

When formulating a policy it may be helpful to see the approaches taken by other institutions. A search on Google for “url naming polices” (and variations) obtained information from the Universities of Bath [20] and Oxford [21] and the University of Vermont [22].

The final thoughts should be left to the W3C. In their policy on URI persistency they provide a brief summary on their approaches to persistent URIs and describe what will happen if the W3C ceases to exist [23] A similar statement has been made by SWAG (Semantic Web Agreement Group) [24].

Long term persistency of URIs will become of even greater importance in the future. It will not only be users who are frustrated by 404 error messages, but, unless this problem is addressed, automated Web services will fail to work, and scholarly resources and legal documents will be lost. The clever technological fix (PURLs, URNs, etc.) has clearly failed to deliver a solution for mainstream Web sites. The onus is on ourselves, as providers of Web services, to take the problem of persistency seriously.

References

Author Details

Picture of Brian Kelly Brian Kelly
UK Web Focus
UKOLN
University of Bath
Bath
BA2 7AY

Email: b.kelly@ukoln.ac.uk

Brian Kelly is UK Web Focus. He works for UKOLN, which is based at the University of Bath