Persistent identifiers (PIs) are simply maintainable identifiers that allow us to refer to a digital object – a file or set of files, such as an e-print (article, paper or report), an image or an installation file for a piece of software. The only interesting persistent identifiers are also persistently actionable (that is, you can "click" them); however, unlike a simple hyperlink, persistent identifiers are supposed to continue to provide access to the resource, even when it moves to other servers or even to other organisations. A digital object may be moved, removed or renamed for many reasons. This article looks at the current landscape of persistent identifiers, describes several current services, and examines the theoretical background behind their structure and use. Issues are raised of likely relevance to anybody who is considering deployment of a standard for their own purposes.
URLs are often implemented using the server's filesystem as a kind of lookup database: for example, http://www.ukoln.ac.uk/index.html is a file called 'index.html' that is situated in the root directory of the Web server running on port 80 of the machine responding to www.ukoln.ac.uk. Because it was very simple to get up and running quickly, many early servers tended to refer to digital objects in this way. For example, while http://www.ukoln.ac.uk/index.html means "The digital object that is provided when you ask the Web server (running on port 80 of the IP that is currently returned by a DNS server for the domain 'www.ukoln.ac.uk') about the string '/index.html'", for many servers this has meant "The file named 'index.html' at the root level". There are certain advantages to this approach: for example, a clear mapping between the filesystem and the structure of the Web site can make it easier for maintainers to understand how the site is structured, but with no other mechanism in place, if someone removes the file the link will break – easy come, easy go.
Recognition of the vulnerabilities associated with this design approach is changing the way that URLs are being implemented. To take another example, many Web applications assign different syntax to elements within a URL. The technology-specific URL extension applied to Web objects (.asp, .php3, .jsp) is often mapped by an interpreter (such as Apache's mod-rewrite) in a way that hides the server's internal filesystem structure. For example, a piece of blogging software may be designed to apply simple rewrite rules to read a semantically meaningful URL such as 'http://joansmith.com/2007/11/30' as a request for a listing of any blog postings written by Joan Smith on the 30th of November, 2007. Replacing the software with a new version will not break these links, assuming that the new version is also capable of interpreting the URL appropriately.
While this goes a long way to reduce identifier fragility within a server, all actionable identifiers are vulnerable to a different kind of breakage that is related to the name of the server itself. For example, if Joan's blog becomes very popular for news on a given topic, she might choose to allow others with similar interests to post articles to the blog. As a result of the change of focus from a personal to a topic-driven site, it might be appropriate for her to change the site's hostname to reflect the topic. In another example, if a Web site is owned by a company, it is very likely that the name will be changed on occasion for commercial reasons, perhaps as the result of a corporate take-over, merge or rebranding. The content of the site may change for similar reasons, or for any number of other reasons related to the roles that Web-based publishing play in our lives, and in the marketing and image of individuals and enterprise. Such URLs cannot be made persistent after-the-fact, but going forward publishing URLs under hostnames in a manner that won't be under pressure to change (e.g., free of brands or other meaning) will reduce the chance of this kind of breakage.
Should normal server administration, including name changes, break identifiers that you want to be persistent? The opinion of the persistent identifier community is that they should not.
Most clickable identifiers are brittle. This is to be expected because persistence takes planning and monitoring, which cannot feasibly apply to every object and rarely appears high up the list of most organisations' priorities. What is easiest to set up is often not what lasts. Organisations need to plan for persistence, be aware of the pitfalls, and stay on top of their commitments. Questions that one may ask oneself during this planning process include: 'Which identifier parts are vulnerable to commercial/political, technical, and organisational change?' 'To which identifiers are we committed?' and 'How are they shaping user expectations about those to which we are not committed?' As time moves on, few institutions keep a record of all the URLs they have published, and all too often a massive technical or structural re-organisation causes large numbers of URLs simply to be abandoned.
The above is unpalatable; as Tim Berners-Lee once put it, Cool URIs Don't Change . The trouble is that many organisations today, for a variety of reasons, fail to decide which of their URLs are 'cool' and to develop a strategy that ensures that they remain so.
URIs vs. URLs
URIs or Uniform Resource Identifiers, post-date the URL or Uniform Resource Locator by three years. URLs describe the 'street address' of a resource – where it can be found. The URI, on the other hand, can describe 'a name, a locator, or both a name and a locator' .
The technical background is the following: in order to retrieve an object, the browser needs to communicate with a service that is able to provide the location of that object. If the link turns out to be broken, this results in a '404 Not Found' error message. There are, of course, many subtler failure cases, such as the possibility that the wrong object is retrieved.
An important strategy to help reduce the danger of failing to retrieve an object is to add a layer of indirection between the browser and the target object. The persistent identifier itself provides some form of description of the digital object, rather than referring to a specific instance (a copy of the object). Indirect identifiers require a resolver that forwards you to a current copy of the object. Describing the object to a resolver permits the browser to find a specific instance of it at the last minute; for example, indirection is a common fix for servers that rely on the previously mentioned use of the filesystem as lookup database. The resolver service, which could be as simple as a set of rewrite rules on a server or as complex as a global network of special purpose servers (e.g. DNS, Handle), is intended to redirect the browser to an appropriate or current copy of the object. Indirection is often invisible to the user.
Link resolvers are useful for a multitude of reasons. They may recognise the user's geographical location or access rights for that object and forward the user to the appropriate version of the digital object. Furthermore, there may be any number of secondary services related to that object, to which a URI resolver may be able to link, such as bibliographic information, production credits, content management services, rights management and licence administration .
Much of this subsidiary functionality, though interesting in its own right, should not be seen as directly related to the primary subject of this article, namely persistence.
Several issues are particularly relevant to design and adoption of persistent identifier systems:
The actionability of the persistent identifier – can it be used directly? Does it do something when you copy it into a browser location bar and press 'enter', or is it necessary to copy it into a resolver service in order to retrieve a link to the digital object in question?
The scope of the identifier standard – does it link only to digital objects or can it be used more widely, for example as a semantically valid way of referring to physical objects within a given domain or description language?
The architecture and infrastructure underlying the standard are of relevance as regards issues such as reliability, maintenance cost, and risk.
The status of the standard – is it a formal standard, having undergone the process of standardisation, a de facto standard or an ad hoc approach?
In the following section we examine several standards on the basis of these metrics, in order to gain an overview of the current persistent identifier landscape.
There are several standards currently at a mature stage of development:
Many of these are described in 'Request for Comments' (RFC) documents, a commonly used means of proposing an Internet standard. Where this is the case, the RFC in question has been linked from within the text. Others are described via other publishing strategies.
There are also many 'informal' standards for content-based persistent identification. Content-based identifiers of various kinds, such as message digests and characteristic word sequences for which to search or ed2k hashes used for peer-to-peer object discovery purposes on filesharing networks, can permit a digital object to be characterised and perhaps retrieved by searching and matching for the characteristic 'fingerprint' of the document. These schemes will not be discussed here, but it is worth contrasting the following standard set with the 'fingerprinting' – that is, feature extraction and application to identify a digital object – approaches that are sometimes applied with success elsewhere.
The dates given here are typically either the date of the first full specification, or the date on which the service became available, rather than the origin of the idea. Those looking for a more in-depth discussion of many of these standards may wish to start with the report on persistent identifiers authored in 2006 by Hans-Werner Hilse and Jochen Kothe .
The URN (Uniform Resource Name) was fully specified in May 1997. Its requirements were originally defined in RFC 1737 and the specification was published in RFC 2141. The use of a URN does not necessarily imply that the resource to which it refers is available.
URNs are designed to describe an identity rather than a location; for example, a URN may contain an ISBN (International Standard Book Number, used as a unique, commercial book identifier). For example, the following URN describes a book called The Computational Nature of Language Learning and Evolution, written by Partha Nyogi.
URN encodings exist for many types of objects; the ISSN for magazines (International Standard Serial Number), the ISAN for films and video (International Standard Audiovisual Number), Internet Engineering Task Force Requests for comment (RFCs), etc. Using the OID (Object IDentifier) system, it is even possible for a URN to reference Great Britain.
URN namespace assignments are handled via the IANA, the Internet Assigned Numbers Authority. URN namespace definition mechanisms are described in RFC 3406. The document makes provision potentially for namespace registration to incur a cost.
The refined specification for National Bibliography Numbers (NBNs) was published in 2001. The NBN is a URN namespace used solely by national libraries, in order to identify deposited publications which lack an identifier, or to reference descriptive metadata (cataloguing) that describe the resources (RFC3188). These can be used either for objects with a digital representation, or for objects that are solely physical, in which case available bibliographic data is provided instead. NBNs are a fall-back mechanism; if an alternative identifier is available such as an ISBN, it should be used instead. If not, an NBN is assigned.
NBNs are encoded within a URN according to the encoding described in RFC 3188.
The Digital Object Identifier (DOI) was introduced to the public in 1998. The DOI is an indirect identifier for electronic documents based on Handle resolvers. According to the International DOI Foundation (IDF), formed in October 1997 to be responsible for governance of the DOI System, it is a 'mechanism for permanent identification of digital content'.
It is primarily applied to electronic documents rather than physical objects. It has global scope and a single centralised management system. It does not replace alternative systems such as ISBN, but is designed to complement them . DOIs consist of two sections: a numeric identification consisting of a prefix identifying the term as a DOI (10.) and a suffix identifying the document's publisher. The document is then identified with a separate term. The document and publisher are separated by a forward slash, in the form:
The suffix following the forward slash is either automatically generated by the agency registering the DOI, or is contributed by the registrant. In practice, the suffix is limited to characters that can be encoded within a URL. DOIs are not case-sensitive.
In general, no meaning should be inferred to the content of the suffix beyond its use as a unique ID. DOIs may be resolved via the Handle system. Although DOIs are designed for Unicode-2 (ISO/IEC 10646), the required encoding is UTF-8 due to the fact that the Handle.net resolver uses UTF-8. The DOI is formalised as ANSI/NISO Z39.84-2005, and is currently in the later stages of the ISO certification process.
DOI registration incurs a cost both for membership and for registration and membership of each document, and as such it may in some situations be considered preferable to make use of the Handle.net resolver without the use of DOIs.
A piece by Tim Berners-Lee and others, 'Creating a Science of the Web', provides a real-world example. It has been given the DOI:
This DOI may be resolved by going to the following URL (dx.doi.org functions as a DOI resolver, an implementation of the Handle system):
The user should then be forwarded to the appropriate page.
Persistent Uniform Resource Locators (PURLs), proposed in 1995 and developed by OCLC, are actionable identifiers. A PURL consists of a URL; instead of pointing directly to the location of a digital object, the PURL points to a resolver, which looks up the appropriate URL for that resource and returns it to the client as an HTTP redirect, which then proceeds as normal to retrieve the resource. PURLs are compatible with other document identification standards such as the URN. In this sense, PURLs are sometimes described as an interim solution prior to the widespread use of URNs.
PURL is primarily linked to OCLC , which continues to run the oldest PURL resolver in existence. It has however been strongly influenced by the active participation of OCLC's Office of Research in the Internet Engineering Task Force Uniform Resource Identifier working groups . There is no cost associated with the use of PURLs.
The Handle system, first implemented in 1994, was published as an RFC in November 2003. It is primarily used as a DOI resolver (see example above). In practice, it is a distributed, general-purpose means for identifying and resolving identifiers. Both master and mirror sites are administrated by the Corporation for National Research Initiatives (CNRI), and the distributed nature of the service ensures reliable availability. An overview of the Handle system is available from Lannom  and relevant RFCs . The Handle.net system may also be used separately to the DOI system. The underlying software package may be downloaded and installed for institutional use.
OpenURL, dating from 2000, contains resource metadata encoded within a URL and is designed to support mediated linking between information resources and library services. The OpenURL contains various metadata elements; the resolver extracts them, locates appropriate services and returns this information. It is sometimes described as a metadata transport protocol.
This standard is not primarily designed as a persistent identifier/resolver. There are many other issues that make an institutional resolver service useful – such as the problem of access rights and the need to find a source that the user has permission to read. Moreover, although of less significance in the matter of text documents, issues of bandwidth and efficiency would make it desirable to use a local mirror. This is still very much an issue in multimedia resource or software distribution.
An OpenURL is formed of a prefix (a valid HTTP URL linking to the user's institutional OpenURL resolver) and a suffix, which is simply a query string encoded according to the URI RFCs, eg. RFC3986-- which deprecates RFC2396, against which OpenURL was initially defined. Here is an example of an OpenURL:
This example describes the Multimedia Systems journal (ISSN 0942-4962), and is resolved by the SpringerLink resolver. The following example describes a book, Understanding Search Engines. Note that, whilst these examples both include a unique identifier (the ISBN), OpenURLs can also validly contain only part of this information (for example, title or author).
OpenURL was initially drafted in 2000 , then formalised into NISO OpenURL Version 0.1 in 2003. Since then, it has been developed into ANSI standard Z39.88. The standard is maintained by the OCLC.
The Archival Resource Key (ARK), dating from March 2001, is a URL scheme developed at the US National Library of Medicine and maintained by the California Digital Library. ARKS are designed to identify objects of any type – both digital and physical objects.
In general, ARK syntax is of the form (brackets indicate [optional] elements):
The Name Assigning Authority (NAA) refers to the organisation responsible for this particular object naming. This information is encoded into each ARK via the Name Assigning Authority Number (NAAN). The NAAN is a unique identifier for this organisation, registered in a manner similar to URN namespaces. The Name Mapping Authority (NMA) is the the current service provider (e.g., the Web site) responsible for the object itself. NAAs often run their own NMA, although this is not mandatory. While the ARK scheme encourages semantically opaque identifiers for core objects, it tolerates branding in the hostname (NMA); however, recognizing that this does not always play well with persistence, in no case does the hostname participate in comparing two ARKs for identity. Semantic extension (Qualifier) is also tolerated, as this often names transient service access points. An example ARK is given as follows :
Unlike an ordinary URL, an ARK is used to retrieve three things: the object itself, its metadata (by appending a single '?'), and a commitment statement from its current provider (by appending '??'). An ARK may retrieve various types of metadata, and is metadata-format agnostic. One metadata format with which its use is closely related, however, is the Electronic Resource Citation (defined in draft RFC draft-kunze-erc-01), which contains Kernel metadata, defined in the same draft document. The full ARK specification is also available .
WebCite, dating from 2003, and a member of the International Internet Preservation Consortium, takes a slightly different approach to preservation, starting from the observation that systems such as DOI require a unique identifier to be assigned for each new revision of the document, and that therefore they are relatively ineffectual when used on resources that are less stable and are not formally published. First published in 1998 in an article in the British Medical Journal that discussed the possibility of a Web-based citation index , the idea was briefly tested, laid aside for several years and revived in 2003 when it was noted that obsolescence and unavailability of cited Web references remained a serious issue. WebCite is not precisely a persistent identifier system. Rather, it offers a form of on-demand archiving of online resources, so that the cached copy may be cited with a greater likelihood that it remains available in future. For this reason, it encounters a different set of problems; for example, some information cannot be cached because it is not available to the service, perhaps because it is behind a firewall or available only to those with a subscription to a certain digital library. Site maintainers may additionally choose to write a robots.txt file, disallowing the use of Internet robots or spiders on their Web sites for various reasons.
There are many practical differences between the above systems and specifications. Inevitably, the story of persistent identifiers subsumes that of semantic identifiers, such as the URN and the DOI. As a result, many of the distinctions between these standards have to do with the aim and scope of the relevant identifier standards rather than with the resolver in question.
Some systems promote the use of opaque identifiers over the use of semantically meaningful strings. There are legitimate reasons to prefer the use of opaque terms. Firstly, branding and nomenclature are not static, but evolve along with the organisation, topic and task. They are also culturally and socially situated. Hence, an opaque identifier removes one motivation for name changes – i.e. rebranding or redefinition (semantic shift and drift) – but does so at the cost of a certain amount of convenience for the user. There is a trade-off between the ability to track down a resource, should the persistent identifier fail to resolve, from the semantic information available in the string, and the increased likelihood that a string containing meaningful semantics will at some point be altered.
The ARK system lies at one extreme of this spectrum (at least for the core object identifier), with the OpenURL at the other.
The ARK and PURL (indeed, even the URL) specifications describe a system that anybody could host. Similarly, the Handle server can be locally installed in a manner similar to any other Web server. By comparison, the DOI identifier has a number of associated fees, and is centrally administrated by the International DOI Foundation. Fees are often associated with reliability and authority. There is a perception that those willing to invest in a system have made a financial commitment to that organisation. This, however, assumes facts not in evidence. It is not an unreasonable assumption that a financial commitment implies that an organisation has made a considered decision to the cause, but the possibility exists that use of DOIs may be seen as part of 'the cost of doing business'.
Increased flexibility in an identifier standard comes at the cost of complexity. The URN is a superclass of many different namespaces and encodings. It also results in some philosophical (epistemological) problems; what does a URN describing Britain have in common with a URN describing a magazine article? A URN can function as a surrogate for an ambiguous verbal description, but can it be considered as a candidate for a persistent identifier scheme? The above also raises the question: if the recognition of the four British home nations results in Scotland electing to leave the United Kingdom of Great Britain and Northern Ireland, what will this do to our URN namespace? Ambitious identifiers, particularly the URN, are not in general applied to the full capability of the standard. There is a cost to increased flexibility, which manifests itself in the added costs of complete implementation on the application level and potentially an increased load on interface design.
Clearly, these standards have a very heterogeneous level of adoption in the present day. Although many organisations have shown an interest in ARK, serious production implementation is seen in only a few places (such as the National Library of France, Portico, CDL, and the University of California). By comparison, DOIs are widely used, particularly in certain disciplines – this asymmetry is likely to be due to the fact that certain journals have picked up the concept of the persistent URI more quickly than others.
The cost of the DOI is often justified by describing the investment as a means of making visible to the public the commitment of one's organisation to the long-term availability of digital resources. This argument echoes a common theme threaded throughout the many discussions that have been held over the last few years on the topic of persistent URIs: the centrality of organisational commitment to the process. The ability to manage and quickly update a set of persistent identifiers centrally is valuable. However, it is inevitably necessary for the maintenance process to be honoured in terms of working hours set aside for the purpose on an ongoing basis, and this task is in most cases set aside for the publishers of that persistent identifier.
From the perspective of someone looking at persistent identifiers for the first time, this collection of standards is somewhat overwhelming. Each has been built to respond to different needs, and it is perhaps only with the benefit of hindsight that it is possible to look through these many standards and see the number of common threads that link these initiatives together.
Choosing from these options on the basis of a brief written description is difficult if not impossible. In general, most software deployment processes start with requirements gathering and analysis, and the choice of a persistent identifier system is no exception. The context in which the system will be used is an important factor, particularly since adoption of these standards is quite uneven and very dependent on the intended usage and topic area. The precise patterns of adoption across disciplines and thematic areas are not yet known, although there are many bodies discussing adoption in particular contexts.
Particular attention has been put into a number of user scenario-related issues for the ARK; for example, practical issues such as the difficulties encountered copying a URL from paper into a browser location bar have been incorporated into the design. For this reason, the ARK specification and documentation are recommended reading for those looking into the various options for a persistent identifier standard. However, the standard that has achieved widest market penetration is likely to be the URN/DOI and associated resolver (taking the DOI as an interim manifestation of the general URN concept).
There is ample proof that the eventual 'decay' of hyperlinks is a problem. Statistics are available giving the number of resources in general that remain available after publication, such as results from a small study of around 600 links providing supplementary resources in biomedical publications , which found that 71 - 92% of supplementary data remained available via the links provided by authors, and that there was a higher chance of links becoming unavailable if they were moved to sites beyond the publisher's control. A 2004 study of URLs in Medline found that 12% of URLs were published with typos or other formatting errors, around two-thirds of HTTP URLs were consistently available, and another 19% were available intermittently. FTP fared far worse, with just over a third of sites responding .
The most important question, given the focus of the effort, is this: how effective are persistent identifiers at ensuring the long-term availability of resources? Each of these standards is still quite new. How well do they work? Do they all have similar success? What will the failure modes be?
Other important issues for those looking to deploy persistent identifiers include the relevance and longevity of the standard within their own domain; it is possible that, for instance, a standard optimised for a business environment will prove to be inappropriate for an academic one. Additionally, it is useful to ascertain the standard, if any, that will be most familiar to users in the intended domain of use, by engagement with users and perhaps a statistical survey of the area.
There are a number of other considerations. For instance, if a persistent identifier standard is primarily destined for use in the print medium, does this influence the required feature set, as is suggested by the ARK specification? Would a checksum/error-correction mechanism be appropriate? If the semantic readability of identifiers indeed has a negative impact on their longevity (as described above), can this effect be quantitatively demonstrated? On the other hand, the ability to extract appropriate information from an identifier that no longer resolves correctly may permit one to search via a search engine or a library catalogue and retrieve the new location of the lost resource. Which of these effects is more noticeable, and where should the trade-off lie? Does the end-user understand the significance of persistent identifiers, and in which cases at present are they correctly applied?
Technology cannot create a persistent identifier, in the digital library community's sense of the term. This is primarily due to the fact that the longevity of each of these persistent identifier schemes (other than OpenURL) is closely linked to the information provider's long-term commitment to keeping records pertaining to their data available and up to date. To offer a persistent identifier for a document via a given resolver additionally implies entering into a long-term commitment with the organisation maintaining that resolver. However, picking the right standard is an important step in ensuring that the infrastructure remains available. Software will age with time, so a more complex infrastructure implies a greater commitment from the organisation that developed and made available the resolver package.
There is likely to be space for several persistent identifier standards, both within the digital library world and elsewhere. Internet infrastructure in general will benefit from similar standards, and indeed many resolver services have sprung up that offer some of the functionality of the persistent identifier, such as TinyURL (tinyurl.com), SNURL (snipurl.com) and elfurl (elfurl.com). However, they do not offer a commitment to persistence or high availability. That said, they are very much more widely used than the formal persistent identifier services of the digital library world. Much the same is true of the content-based persistent identification approaches mentioned above. Whilst it is tempting to suggest that this simply results from the fact that the most common applications of some such schemes are for peer-to-peer download purposes, there is another side to this. They demonstrate that, given a comfortable interface and clear description of the purpose of the application, average end-users are able and willing to make use of resolver services.
In summary, this is an important area, and one in which there are more questions than answers. The valuable services and standards that are presently available form an excellent base for further exploration of this area; but it remains a confusing landscape for the end-user and for the potential contributor of persistent identifiers. Much more needs to be done in terms of articulating pragmatic use cases and publicising the reasons to invest in persistent identifier definition and maintenance. Equally, each standard can learn from the successes, failures and insights of the others.
It is crucial to accept that managed persistence in some contexts may be unachievable. There are limited resources available for ensuring availability of old data, an operating overhead that grows with each year of output, and as such there may indeed be a motivation for providing a coherent mechanism for withdrawing that assumed obligation to manage persistent identification, moving perhaps instead into a cheaper if less trustworthy scheme for information discovery.
Finally, whilst we must acknowledge that solving the problem in the general sense is almost certainly asking too much, there are now many options available for institutions which count long-term availability as a priority. This is a complex space with many players, from national libraries and major journals to small publishers and individuals, and it is unlikely that 'one size will fit all'. As oxymoronic as it may seem to 'experiment with preservation', that is perhaps what we must do.
Many thanks to John A. Kunze for his enthusiasm, and the comments and suggestions he contributed during the writing of this article.