Discussion of digital identifiers, and persistent identifiers in particular, has often been confused by differences in underlying assumptions and approaches. To bring more clarity to such discussions, the PILIN Project has devised an abstract model of identifiers and identifier services, which is presented here in summary. Given such an abstract model, it is possible to compare different identifier schemes, despite variations in terminology; and policies and strategies can be formulated for persistence without committing to particular systems. The abstract model is formal and layered; in this article, we give an overview of the distinctions made in the model. This presentation is not exhaustive, but it presents some of the key concepts represented, and some of the insights that result.
The main goal of the Persistent Identifier Linking Infrastructure (PILIN) project  has been to scope the infrastructure necessary for a national persistent identifier service. There are a variety of approaches and technologies already on offer for persistent digital identification of objects. But true identity persistence cannot be bound to particular technologies, domain policies, or information models: any formulation of a persistent identifier strategy needs to outlast current technologies, if the identifiers are to remain persistent in the long term.
For that reason, PILIN has modelled the digital identifier space in the abstract. It has arrived at an ontology  and a service model  for digital identifiers, and for how they are used and managed, building on previous work in the identifier field  (including the thinking behind URI , DOI , XRI  and ARK ), as well as semiotic theory . The ontology, as an abstract model, addresses the question 'what is (and isn't) an identifier?' and 'what does an identifier management system do?'. This more abstract view also brings clarity to the ongoing conversation of whether URIs can be (and should be) universal persistent identifiers.
For the identifier model to be abstract, it cannot commit to a particular information model. The notion of an identifier depends crucially on the understanding that an identifier only identifies one distinct thing. But different domains will have different understandings of what things are distinct from each other, and what can legitimately count as a single thing. (This includes aggregations of objects, and different versions or snapshots of objects.) In order for the abstract identifier model to be applicable to all those domains, it cannot impose its own definitions of what things are distinct: it must rely on the distinctions specific to the domain.
This means that information modelling is a critical prerequisite to introducing identifiers to a domain, as we discuss elsewhere : identifier users should be able to tell whether any changes in a thing's content, presentation, or location mean it is no longer identified by the same identifier (i.e. whether the identifier is restricted to a particular version, format, or copy).
The abstract identifier model also cannot commit to any particular protocols or service models. In fact, the abstract identifier model should not even presume the Internet as a medium. A sufficiently abstract model of identifiers should apply just as much to URLs as it does to ISBNs, or names of sheep; the model should not be inherently digital, in order to avoid restricting our understanding of identifiers to the current state of digital technologies. This means that our model of identifiers comes close to the understanding in semiotics of signs, as our definitions below make clear.
There are two important distinctions between digital identifiers and other signs which we needed to capture. First, identifiers are managed through some system, in order to guarantee the stability of certain properties of the identifier. This is different to other signs, whose meaning is constantly renegotiated in a community. Those identifier properties requiring guarantees include the accountability and persistence of various facets of the identifier—most crucially, what is being identified. For digital identifiers, the identifier management system involves registries, accessed through defined services. An HTTP server, a PURL  registry, and an XRI registry are all instances of identifier management systems.
Second, digital identifiers are straightforwardly actionable: actions can be made to happen in connection with the identifier. Those actions involve interacting with computers, rather than other people: the computer consistently does what the system specifies is to be done with the identifier, and has no latitude for subjective interpretation. This is in contrast with human language, which can involve complex processes of interpretation, and where there can be considerable disconnect between what a speaker intends and how a listener reacts. Because the interactions involved are much simpler, the model can concentrate on two actions which are core to digital identifiers, but which are only part of the picture in human communication: working out what is being identified (resolution), and accessing a representation of what is identified (retrieval).
So to model managing and acting on digital identifiers, we need a concept of things that can be identified, names for things, and the relations between them. (Semiotics already gives us such concepts.) We also need a model of the systems through which identifiers are managed and acted on; what those systems do, and who requests them to do so; and what aspects of identifiers the systems manage.
Our identifier model (as an ontology) thus encompasses:
An individual identifier system can be modelled using concepts from the ontology, with an identifier system model.
In the remainder of this article, we go through the various concepts introduced in the model under these classes. We present the concept definitions under each section, before discussing issues that arise out of them. Resolution and Retrieval are crucial actions for identifiers, whose definition involves distinct issues; they are discussed separately from other Actions. We briefly discuss the standing of HTTP URIs in the model at the end.
The following concept definitions apply to entities:
In the following examples, we notate names as (context, label), and identifiers as (name, thing) = ((context, label), thing).
These definitions support the following insights:
Any association of a name with a thing - by anyone - establishes an identifier. A name is not an identifier unless it identifies something. (E.g. an unassigned phone number is a name, but not an identifier.)
An identifier is not restricted to an association of an ASCII string or a stream of ones and zeros with a thing; a spoken word or a picture also count as identifiers. (Identifiers are in fact defined as linguistic signs.)
The context of the name differentiates instances of a label from each other, and determines which particular instance is being associated with a thing. This allows the same label to mean different things in different contexts .
An identifier management system delimits its own context for the identifiers it manages; so the same label, managed by two different identifier management systems, forms two different identifiers.
Isolated facets of identifier management systems, such as protocol and encoding scheme, may also be considered part of the identifier context - meaning that a change in either brings about a different identifier. But this is a matter of identifier management policy: a particular identifier system model can also decide that its identifiers remain the same regardless of protocol or encoding.
Policies are specific to contexts. In some instances, particular policies in fact set the context of an identifier (see below).
The following are definitions of relations for identifiers:
Equivalence between two identifiers may happen to be true at a given point in time; it does not mean that the two identifiers will always mean the same thing, or should always be treated as interchangeable. Judging whether two things are the same or not presupposes an information model for the things being compared.
Synonyms presuppose an authority which weighs in on the equivalence of two identifiers; the authority can also weigh in on which identifier should be preferred in given contexts. This is still only one authority's claim, and other authorities can make different judgements; but the claim matters for any systems for which that authority is also responsible, as the authority is assumed to enforce its claims throughout its domain. By introducing responsibility, synonymy is a stronger claim than equivalence.
Aliases require that the authority does not just assert equivalence, but actively manages the equivalence itself: it is responsible for making sure the two identifiers stay equivalent while they are being managed, and do not drift apart in what they refer to.
The identifier model constrains when two instances of labels count as the same name. If their contexts differ, they may currently be equivalent when used as identifiers; but nothing guarantees that they will stay equivalent, because the association policies of the two contexts are independent—that is, the two names are managed separately. The model deals with this case by saying that the labels may be the same, but the names are different (since their contexts are different), so they belong to different identifiers. In the example above, the Handle Server and the PURL server may currently agree to use the same label to point to the same thing; but nothing prevents one of the authorities reassigning the identifier later on, and the other keeping it as is.
Other existing information models may have a larger or smaller repertoire of relations for identifiers. This set of relations may map to existing information models in different ways, but is intended to make explicit the role of authorities, which is often left implicit.
The following additional definitions apply to relations:
Changing the encoding scheme of a label does not change the label itself; so different encodings of an identifier are not considered distinct identifiers—so long as we know what the encoding scheme is. So the IRI http://en.wikipedia.org/wiki/Pelé and the URI http://en.wikipedia.org/wiki/Pel%e9 are not considered to be distinct identifiers, but different encodings of the same identifier. Allowing different representations of labels lets us treat labels as Platonic ideals, which can be realised in several ways, for example: a spoken URL, a handwritten URL, and a URL transmitted in an HTTP request are the same identifier. The alternative of treating each as a distinct identifier is untenable.
Contexts are seldom made explicit in digital identifiers. Contexts have identifiers of their own, but they are seldom included when citing an identifier. However scheme prefixes in URIs identify contexts at least partly: http://www.example.com is a distinct identifier from ftp://www.example.com, because the named protocol provides a distinct system context for the label proper, www.example.com. Identifiers and identifier contexts are not defined by the services provided for them, or the protocols enabling those services: the contexts exist independently of them. For example the RFC 3986  definition of HTTP URI specifies that an HTTP URI is not constrained to be processed through HTTP .
This seems to contradict our preceding claim - that http://www.example.com and ftp://www.example.com are distinct identifiers. Our claim does hold up though: the two identifiers share the same DNS domain but are managed separately. http://www.example.com/a.pdf can be a different document from ftp://www.example.com/a.pdf (because the respective server roots are different). That immediately makes them distinct identifiers. That said, http://www.example.com/a.pdf could be accessed through the FTP protocol instead of HTTP, without becoming identical to ftp://www.example.com/a.pdf. In general, a digital identifier can be acted on through several services and several protocols, but remain the same digital object, managed in the same identifier management system. The critical distinction is the management system for the identifier, not the service protocol for accessing it.
It can be useful to point out that the nominated context for an identifier is a subcontext of another, whose policies have already been specified. Because the larger context determines policies for the identifier which the subcontext follows, the larger context is of more interest to users: its policies are more generally applicable. And if the larger context's policies have already been specified, there will be much less policy to specify for the subcontext. The nesting of contexts is also useful to point out if the two contexts will end up managed through the same identifier management system.
For instance, we could argue that in the PURL
http://purl.foo.com/net/jdoe/bar, the label is bar, and the context name is http://purl.foo.com/net/jdoe/ . We could instead argue that the label is net/jdoe/bar, and the context name is http://purl.foo.com/ . Both segmentations are legitimate; but http://purl.foo.com/ defines a larger context for identifiers than does http://purl.foo.com/net/jdoe/ (all PURLs vs. all PURLs in the net/jdoe subdomain); and any constraints set by the enclosing context (e.g. "is resolved by purl.foo.com") also apply to the net/jdoe subdomain. So we take the enclosing context as the starting point for understanding the PURL, rather than the subdomain.
Identifiers can exist in the abstract, as mental constructs; but they can only be managed and acted on in the physical world, through identifier management systems, as digital objects. Systems allow interaction with identifiers digitally, which enables actions on the identifiers; but they also allow the identifier administrators to take responsibility for the identifiers, and to maintain the identifiers as digital objects. An abstract identifier and a concrete identifier are not the same thing: the identifier management system can place no constraints on the abstract identifier. Instead, the identifier model has the concrete identifier realise a corresponding abstract identifier.
This means that one abstract identifier can be realised by more than one concrete identifier. This happens when two different identifiers, in two different identifier schemes, are managed to be equivalent—e.g. both a Handle and an ARK, or URLs in two different domains. This equivalence makes sense only if the two identifiers are understood to be fulfilling the same underlying purpose (and not merely contingently). So the identifier model accounts for the statement that http://www.example.com/pdf/a.pdf is migrated to http://cms.example.com/repository/a.pdf (an ostensive change in identifier), by claiming that both concrete identifiers realise the abstract identifier ("example.com's PDF repository", "a.pdf"). The concrete identifiers are defined by particular servers and systems; the abstract identifier is defined by the management and intention common to both.
If the label in the two concrete identifiers is the same, then the same label, in different contexts, is used to identify the same thing. The contexts are still different, so the two are not guaranteed to remain synonymous. Because the two concrete identifiers can nonetheless be confused as being the same, the identifier model gives them a distinct name (homologues). Retaining the same label across contexts is useful in managing multiple contexts.
The identifier model's approach distinguishing abstract from concrete identifiers is not common: usually if two concrete identifiers are distinct, no attempt is made to model any underlying identity between them. In some approaches, this differentiation even extends to differences of encoding, or of names of contexts. (For example, the URIs doi:10.1000/182 and info:doi:10.1000/182, are distinct strings, and some systems will process them as distinct identifiers for that reason - even though they are in fact different encodings of the same identifier.) However the notion of an abstract identifier allows us to capture the intent behind associating a label with a thing, which ultimately resides in an authority rather than a specific system - let alone a particular encoding or representation. This allows identifiers to be considered not merely equivalent, but synonymous (deliberately and reliably equivalent), because the same authority intends them to mean the same thing. Because identifiers are signs used meaningfully, this intent is important to capture.
The following are definitions of qualities of entities:
It is critical to this model that identifiers identify things uniquely; what that means is determined by the information model used for the things identified. An identifier can identify an aggregation (which is a single thing, but has multiple components); it can also identify an abstraction, which may encompass multiple concrete things (e.g. different versions of a digital object can be identified by the same identifier, because the identifier does not identify a single version.)
Uniqueness is only meaningful relative to a scope. For example, "Perth" is not unique in the scope of city names on Earth (let alone in the scope of the universe); but it is unique in the scope of city names in Western Australia. The scope of uniqueness of a name motivates the definition of a context for the name. So defining the context of the name "Perth" as "city names in Western Australia" means "Perth" can still be used as an identifier unambiguously, in the given context.
Universality is useful for discovery: if only one identifier exists for an object in a registry, then searching for all instances of that identifier in the registry will discover all references to the object. If the object is known to have multiple identifiers, on the other hand, then discovery requires a separate search for each identifier.
If the context for an identifier is "all known naming systems" (the global context), universality is not a realistic expectation. There cannot be only one possible identifier in the world for a given thing, so long as any authority can set up its own identifier management system. However, various alternate strategies emulate universality - particularly preferred identifiers: an authority can advocate that one identifier should be preferred over its synonyms in its specific sphere of influence. This allows the search space for discovery to be constrained. Establishing preferred identifiers is the motivation for normalising names in catalogues and databases.
If the context for an identifier is "a single identifier management system", by contrast, universality is often realised: a particular identifier management system will often have only one identifier for a given thing.
Both "persistent" and "accountable" are second-order qualities when applied to identifiers. Identifiers are not persistent or accountable in themselves, but persistent or accountable with regard to other qualities, such as resolvability, citability, registration, association, and so on.
Persistence is not defined through a timeframe alone. It is defined through an assertion that the given quality of the identifier will be maintained throughout a nominated timeframe. Because persistence is an assertion, it needs to gain users' trust through demonstrating that appropriate management is taking place.
Making an identifier persistent is a matter of policy and not technology. The ability to redirect actions on an identifier to another management system (as done under DNS, Handle, and "Cool" HTTP URIs ) makes it easier to implement identifier persistence policy; but it does not automatically make the identifiers persistent.
Qualities need to be associated with digital objects, if they will be acted on in the digital realm. Accountability, for instance, is realised through accountability data; persistence, as shown below, is realised through maintaining association data.
The following are definitions of actions applied to entities:
As digital objects, identifiers can be registered and deregistered. That is distinct from creating and destroying identifiers: if someone has made the connection between a name and a thing in their head, they have created an identifier, and only erasing their memory will destroy that association. Though concrete identifiers exist only by virtue of their management systems, identifiers can be recorded outside those management systems. (This is important for archival purposes: we can still use deregistered identifiers to identify things in a historical sense.)
Actions on concrete digital identifiers are realised through services on identifier management systems.
The usual target of verification is the resolvability of an identifier: verification confirms that the identifier resolves to something, and moreover that it resolves to the correct thing.
Actionability on an identifier requires the use of an identifier management system. Citing an identifier, for instance, does not depend on the existence of an identifier management system; so citing an identifier (e.g. writing the identifier name down on a piece of paper) does not make the identifier actionable. When an identifier is referred to as Actionable, what is usually meant is that the identifier is Resolvable. The distinction between Resolve and Retrieve is discussed further below.
The following concept definitions relate to the action of publishing entities:
The notion of a curation boundary helps us distinguish between administrators and end-users — even if the community of administrators is distributed and sizeable. Making an identifier accessible to an administrator remotely does not count as publishing it, any more than is making it available locally for editing. An identifier is only published when a user who cannot update the identifier is newly given the ability to act on the identifier in some other way (typically as we will see, to resolve it).
Publishing an identifier depends on who is allowed to act on it, as well as on how they can act on it. An identifier may be resolved by administrators, while it is being prepared for release. But it is only considered published once end-users are also allowed to act on it.
Querying and verifying identifiers are actions typically undertaken in order to curate the identifier, though they are not write operations, and might be accessible outside the curation boundary.
This definition of publishing centres on authorising Read actions through a system. An alternate definition of publishing depends on who has knowledge of the entity published: if an end-user becomes aware of an identifier, and can, for example, cite it, we could speak of the identifier being published. But the definition adopted here requires the end-users to perform actions through the identifier management system: if they can write the identifier down, but they cannot yet resolve it, this definition does not consider it as published yet.
The following concept definitions relate to the actions of resolving and retrieving on identifiers:
Association data captures the association in an identifier management system of an identifier's name and the thing identified. Maintaining this data is the primary responsibility of an identifier management system. However, an identifier record, as a digital object, may also contain other information.
Resolving an identifier is different from querying it. Querying a Handle identifier is done by viewing the entire Handle digital record - including not only any URLs registered (as association data), but also timestamps, permissions, and other metadata. Resolving a Handle identifier, on the other hand, typically involves mapping the Handle to one of the registered URLs.
Resolution and retrieval are often conflated. Resolution distinguishes what the identifier identifies from what it does not; it does not necessarily involve accessing what is identified. In contemporary digital identifier systems, some sort of resolution to a locator is a prerequisite for retrieval. However metadata describing a resource are an acceptable way of resolving an identifier - so long as that metadata uniquely discriminates the thing identified from all other candidates. In the HTTP protocol, resolution and retrieval can be distinguished as HEAD vs. GET.
Multiple resolution is intrinsic to the functioning of appropriate copy protocols like OpenURL , which assume that a single abstract resource can have multiple concrete instances, each with its own locator. Multiple resolution is also commonplace in the operation of large-scale, mirrored Web sites. Usually the selection of one of the multiple instances is a process hidden from the user.
It bears repeating that digital identifiers do not apply exclusively to the digital realm. Not all things identified by digital identifiers are online digital objects, so they cannot all meaningfully be retrieved (e.g. a vocabulary item or an organization - although the description of the organisation may well be a digital object, such as a Web page). Not all identifiers are associated with services to resolve or retrieve the identifiers digitally (e.g. a name roster in Excel); in fact, digital identifiers need not provide retrieval as an option at all. However, there is a strong expectation that online identifiers should at least be resolvable: a user should be able to determine, through some service, what is being identified.
For example, a request to retrieve a resource by its URI identifier (HTTP GET on the URI) can be distinct from a request for the most appropriate copy of the resource, or metadata concerning the resource (HTTP GET on the URI embedded in an OpenURL request), or an archived version of the resource (HTTP GET on the URI embedded in a Wayback Machine  request). Under the Semantic Web, HTTP URIs identifying abstractions may not be intended for derefencing at all - even if they hyperlink to descriptions of the thing identified (see e.g. XML namespaces, or the Semantic Web use of HTTP Status Code 303 See Other ).
For persistent identification of digital resources, identifier management systems should maintain association data independently of the locator used to retrieve the resource - e.g. as a prose description identifying the resource. Even if the network location of the resource is compromised or no longer maintained, administrators should be able to recover what was supposed to be identified.
The HTTP protocol is currently close to universal for interacting with resources on the Internet; this has proven of great benefit in expanding the reach of the Internet and guaranteeing its integrity. Any digital identifier scheme used online realistically needs to provide at least some services through the HTTP protocol. This amounts to exposing those identifiers as HTTP URIs - as is already commonplace, e.g. with Handles, XRIs and ARKs, through resolution and retrieval services.
It is also clear from the foregoing, and from the current definition of HTTP URIs , that HTTP URIs qualify as identifiers (and are no longer bound to be locators, as URLs). Provided they are appropriately managed, nothing prevents them being used as persistent identifiers. There is of course a long history of HTTP URLs not being managed appropriately; but persistence has always been a policy matter. There is no technical barrier to HTTP URIs being persistent, as indeed Tim Berners-Lee pointed out in 1998 .
That said, we take issue with the following common assumptions, that do not follow :
Different identifier schemes address different business requirements, by presenting users with different services and policies. The HTTP protocol has a deliberately restricted repertoire of services, consistent with a resource-oriented rather than a service-oriented view of architecture; and it does not natively support a rich environment for managing identifiers, such as we believe is necessary to support identifier persistence properly . Other identifier schemes, more explicitly oriented towards persistence, provide users with different levels of support and management.
It is important for the Web that all digital identifiers behave as HTTP URIs for dereferencing - resolution and/or retrieval. This has made the modern Web architecture possible. But this does not mean all digital identifiers have to be HTTP URIs, and in particular managed as HTTP URIs, in order to achieve interoperability with other identifiers. HTTP as a service protocol for identifiers does not address all purposes equally well, and there is a place in the Web for other identifier schemes to continue in use, so long as they are exposed through HTTP.
Under the PILIN Project, we have sketched a model for identifiers and identifier services. This model has allowed us to compare different identifier schemes, and identifiers used in different domains, without losing sight of their underlying commonalities. One of the major problems in debates on persistent identifiers has been the different understanding of terminology between proponents of different identifier schemes: these have led to misunderstanding, or inordinate focus on incidental details. The ontology allows us to analyse identifier systems in terms of their base functionality and the requirements they fulfil, rather than being distracted by implementation specifics. To give an example, debate over how identifiers are actionable is simplified by a comparison of how identifier systems dereference identifiers, and by the recognition that all contemporary digital identifier systems provide retrieval, but only some provide resolution.
This more abstract layer of comparison brings clarity to the identifier debates; it enabled us to articulate identifier policy guidelines in a much more focussed manner. Identifier systems can then be mapped back to the business requirements they satisfy more accurately.
The model as presented here is not novel: it represents a convergence of views in various identifier communities, even though communication between those communities has often been difficult. The basic notions underlying the model are drawn from semiotics, and are much older. However, making such a model explicit helps establish which differences between identifier schemes are essential, and which are incidental. It does so especially by foregrounding the requirements users have for identifiers, as desirable identifier qualities.
We hope that our model can help others in the identifier community likewise approach the recurring debates over identifier systems with more clarity and less risk of confusion - and in that way, can focus discussion on issues which truly make a difference to identifier managers and users.
This article reports on work done under the PILIN project and the PILIN ANDS Transition Project. PILIN was funded by the Australian Commonwealth Department of Education, Science and Training (DEST) under the Systemic Infrastructure Initiative (SII) as part of the Commonwealth Government's Backing Australia's Ability - An Innovation Action Plan for the Future (BAA). The PILIN ANDS Transition Project was funded by the Australian Government as part of the National Collaborative Research Infrastructure Strategy (NCRIS), as part of the transition to the Australian National Data Service (ANDS).
The authors wish to acknowledge the support and feedback of the rest of the PILIN team. We also thank Dan Rehak for his feedback.