DCC Workshop on Persistent Identifiers
A Digital Curation Centre (DCC) Meeting on Persistent Identifiers was held over 30 June - 1 July 2005 at the Wolfson Building at the University of Glasgow. This is a new construction (2002) just opposite the 1970s Boyd-Orr building, mentioned before in Ariadne's pages. The architecture of this building is quite unlike the Boyd-Orr building however, being light and airy, with more imaginative use of space: the lecture theatre in which the meeting took place is in the shape of an eye, situated at the edge of the main open space. Walking around the outside of the wood-panelled theatre it seems very small, but it can accommodate around sixty comfortably.
I've focused on a selection of speakers to illustrate the range of the meeting. There were other important contributions to the discussion, not reflected here. All of the presentations are available on the DCC Web site .
Chris Rusbridge introduced the meeting by suggesting that the business of identifiers seems easy, but in fact it is not at all easy to understand the world of Persistent Identifiers (PIs), and there are some 'seriously hard' questions. Identifiers are not just URIs (Unique Resource Identifiers) such as isbn, issn, credit card numbers, star names, proteins, standards, keys, etc. Identifiers relate to different things (with different requirements). But identifiers also relate to quite different classes of things, such as services, collections, series, etc. Identifiers can be used to refer to the abstract concept of a work versus particular versions (editions). The use of identifiers for both the abstract work and its manifestations and representations has an impact on the requirement for uniqueness. The 'persistent' bit isn't really a technical issue - but is rather bound up with social economic issues and factors. The binding between identifier and resource is what needs to be persistent. Tim Berners-Lee once said that 'Cool URIs don't change'.
Chris summed up his introduction by saying that the aim of the workshop was to understand the requirements pertaining to this particular space from the preservation and curation angle.
The speakers for the event included Rick Rogers of the National Library of Medicine in the US, Charles Duncan of Intrallect Ltd, Stefan Gradmann of the University of Hannover, John Kunze of the California Digital Library, Stuart Weibel, OCLC, Henry S. Thompson from the School of Informatics at the University of Edinburgh (also representing the W3C Technical Architecture Group), Andy Powell (UKOLN), Sean Reilly of CNRI, Peter Buneman of the School of Informatics at the University of Edinburgh, and Norman Paskin of the International DOI Foundation. In addition there were many contributions to the discussions from members of the DCC and others in the audience. Altogether there were around thirty-five participants for this important two-day event.
The Ancient History of Identifiers
To kick off the meeting, the history of identifiers was briefly explored during this workshop by Rick Rogers of the National Library of Medicine and the NISO standards organisation. This was in the nature of a whirlwind tour, from the Ancient Near East up to the current use of the Dewey system, via the Greeks, and the identifier system of Robert Cotton (1571-1631), which persists to this day in the way his collection of manuscripts is referenced by the British Library. Rogers made the point that even very quirky identifiers can be persistent, and also that any identifier is better than none. He also drew attention to the fact that (as in the case of Robert Cotton's library) identifiers rely heavily on their context.
The eScience Approach
Peter Buneman of the DCC and the School of Informatics at the University of Edinburgh contributed a presentation which reflected views about identifiers among the community for whom databases and data mining are principal concerns. These differ significantly from the views of other communities, not least when it comes to discussing whether or not useful metadata about resources ought or ought not to be contained in files paths. Or whether or not we ought to represent mined information in the form of file paths.
In the case of digital objects such as databases, they have internal structures and are subject to change over time. This dynamic information is obviously important in the preservation and curation of the data. The issues are difficult to define precisely: are we talking about parsing (digital object) identifiers, or digital (object identifiers)? Some things can easily be converted to XML (Swissprot was cited as an example). He suggested that Ad hoc annotation or citation actually requires 'some notion of location to make the annotation or citation 'stick'. The idea of filepath supplies this order. XML -wrapped data contains information which hierarchically specifies the location of objects. And if you are generating data which is to be cited in some way, you need to have a stable hierarchy and keys to the information. This information needs to be published in a standardised interoperable way, plus you need to keep all previous versions of your data, and to give these versions numbers (again requiring interoperable standards). Citing with time and date he suggested is unreliable, and that it is better to push the time down into the data via version numbers, not time.
The Learning Object Approach
Charles Duncan of Intrallect looked at use cases of persistent identifiers, particularly in the context of learning objects. He claimed not to know much about these identifiers (as many speakers said at this meeting), but knows what he could do with them. He suggested that there were two communities - those who want to define the functional specifications of PI and their resolvers, and those who don't care about that at all, but want to do things that are only possible if identifiers are working in the background. Communities need identifiers and things need to work together, whatever decisions are made about identifiers. Some degree of local autonomy is also needed. We also need to know about decision-making processes for the creation of identifiers, and these to some extent need to be standardised. He suggested that we might have some lessons to learn from the experience of ISBN (International Standard Book Number) - a large number of different communities and people are now accustomed to one kind of identifier, though (perhaps) the process of becoming acclimatised to ISBNs involved heartache over a number of years.
He cited exchanges on the CETIS Metadata list where questions are asked such as 'do I need identifiers for both objects and metadata in my repository?' One of the answers was: 'it depends on what you want to use them for.' In practice not that helpful an answer. Someone else on the list said: 'You shouldn't need to worry too much, appropriate identifiers should be generated automatically by your repository'. We need to separate what the technology supports from what in fact can be done with identifiers.
Discussing some use cases originally presented two years earlier at a CETIS meeting in London he said that he thought that 'we in the e-learning world are no further forward now than we were then about making decisions about identifiers'. 'Respository Z' for example has some metadata (both entity identifiers and metadata identifiers). Someone creates a new version of an object, and new metadata for that new version. Obviously there is a need to relate one to the other - i.e., something like: '+relation ident (isVersionOf)'. It is important to be able to modify original metadata, giving the location of the new version. Say for example an object is duplicated in another repository - should the identifier for that object be added to that object? Essentially we want identifiers to be persistent. So in this case the object retains its identifier, and gets new metadata.
Duncan suggested that we shouldn't define the solution until we have defined the problem. Defining the processes for using identifiers is as/more important than choosing identifiers. We need to think about the cost at the point of issue of identifiers. We need to know what needs an identifier. We need to focus on the properties of identifiers, the services which can be built on identifiers, and whether it matters which identifiers are being used.
OA Publishing and Identifiers
Stefan Gradmann looked at the Identifiers question from the point of view of the specific requirements of the OA Publishing and Repository Management communities, and made some general remarks regarding the conceptual framework. He looked at the functional context - German Academic Publishers (GAP), and at questions arising from this context relating to versioning, integrity, rights/restrictions management, social aspects, granularity, citation and scientometrics. Questions he thought of importance included: what is persistent - object, metadata, or the relationship between the two? And who is entitled to alter document integrity statements? Other questions include rights/restrictions management (who may access documents, and under what conditions). We need to determine what is protected - the object alone, or object plus access method? What frameworks do we use? Can identifiers and DRM (Digital Rights Management) technical frameworks be separated effectively and neatly over time? If not, the consequences in terms of persistency might well be catastrophic, especially in highly interwoven scientific contexts.
Another question, addressing the social dimension to the assignment of identifiers: Is the only guarantee of persistence the commitment of the organisations? It boils down to whom we trust. Authors? Institutions? Companies? Stefan also addressed questions of granularity of citation and references, which used to be very simple: i.e., author, title, ISBN, 'page 231, paragraph 3', etc. But this notion of linear documents will decompose, and will need different schemes for referencing. So, what level of identifier granularity is required for citing scientific works and for citations to work over time? Do we assign identifiers at work, expression or manifestation level? What level of identifier granularity is required for current and future scientometric methods to work as intended?
Gradmann took the most philosophical approach of the speakers at this event, but was careful to indicate that though he was prepared to move away from use cases, he wasn't going to address the question of identity! We have to decide whether or not we identify files, bits and bytes, documents, data, locations, text, images, concepts, information or signs. Also the nature of the relation between identifier and object (likeness, signification, descriptive representation, etc.). He took a short excursion into semiology and suggested that some of the confusion we are facing could be clarified if we look at the language model - which is not a pointer and object model (i.e., word -thing) Hence worth looking at Saussure's model of the linguistic sign. This is not (he assured us) to provide a new methodological framework, but it enables us to understand more precisely some of the questions already present in various presentations given at the preceding workshop in Cork. Such as: are identifiers referential surrogates, or do they function as signs, pointers, locators, as names, etc.? Should they have a meaning? How do we guarantee the persistency of the link beween identifiers and objects as well as the one between identifiers and context, persistently? We need to understand e-research scenarios (with identifiers as a core constituent). One member of the audience suggested that some of the more universal and abstract questions were a waste of time. Stefan replied that if we didn't ask these questions, we'd create identifiers which at some point we would have to throw away.
The Digital Library and Identifiers
John Kunze reflected on the California Digital Library (CDL) as a case study - it has no books, students, or faculty. It has affiliations with state libraries in California. It does content hosting, electronic texts etc, and came up with a digital preservation programme about 2 years ago, and they are still getting their bearings.
What is a persistent identifier? Kunze suggested that 'an identifier is valid for long enough'. Though we'd like to say forever, we know we can't say that (there are dumpsters at the back of libraries). An identifier is a 'relation' between a string and a thing. An ID is not a string (very important). An ID is a matter of opinion, not fact: there will be at least one other provider, serial if not in parallel, or otherwise your objects die with you (not pretty and not convenient, but we have to live with that). Multiple copies will have divergent metadata, but it is better to have multiple copies. It helps to accept a certain amount of disorder. Long-term preservation won't happen unless objects can change residence and diverge. It is better if an object lives in several places at once - the alternative (loss) is worse. We need to agree to disagree. What we say but shouldn't say is that we should not reassign persistent IDs to something else. Or that we shouldn't replace a persistent object with another. But we do. We honestly provide a real kind of persistence, but with very different replacement policies.
CDL identifiers must identify the object, whether or not it is to hand. Metadata is needed. The identifiers must convey different flavours of permanence, and must lead to access. They must be valid for some longish period, and be carried on, in or with the object.
As the two-day workshop progressed, it was possible to see that there were two broad schools of thought in the field currently, pointing in quite different directions. The conversations which occurred both during and after the presentations reflected the difficulty of bringing together these opposing notions about how identifiers ought to work. Some speakers argued for identifiers essentially as unique strings to reference any kind of digital item, others emphasised the need for metadata, either in the unique string, or which ought to be associated with the string in some way. Andy Powell expressed the view that all identifiers should be of the HTTP variety. But for the purpose of identification, not location. This view produced a good deal of discussion in the course of the event. Especially on the implications of using the HTTP format as a persistent identifier, since it looks like an address. Identifying something is not the same as granting access.
Henry Thompson published a document on Architectural Guidelines for Naming on the first day of the workshop which (section 4) argues that 'A URI owner should provide representations of the resource it identifies'. A representation of a resource is not the resource itself however, so a URI might, when the identifier is used to provide a representation say, of someone's medical records, return something other than the records themselves (possibly a description of the contents of the records, or possibly just a confirmation of the name of the person whose records have that URI). Which begs the question of what the identifier actually identifies - the resource or the representation. And we ought to ask, does the representation have a separate identifier?
An answer provided by the W3C Architectures document to the above is that responsibility for the uniqueness of the identifiers belongs (not exclusively) to the owner of the URI. A feature of the discussion was that a number of speakers were sticking closely to the W3C Architecture document on the subject of identifiers, and others were taking a more abstract view of the questions coming up. Those sticking close to the W3C view tended to fall back on the actual wording of the Architecture document, which I think was the source of some of the periodic cross-purposes evident in the discussion.
Broadly the workshop appeared to revolve around two key areas which are problematic in some respects, either for the practical application of identifiers to the real world, or for those who might wish to work within the scope of the W3C architecture.
The first key area has already been mentioned, and is the question of whether identifiers ought to be treated as random strings (even if they look like HTTP URLs), or whether what we want is something which has metadata either contained within the identifier, or indeed whether metadata about the resource is completely irrelevant to the matter of identifiers.
The second concerns a theoretical distinction contained in the W3C Architecture, which seems to generate a good deal of misunderstanding, annoyance and difficulty in the discussion of identifiers. This distinction is between the URI identifying a resource, and the actual location of a resource. In other words, if a resource is identified by (say) http://www.dcc.ac.uk/dcc/workshops/pi-dcc/, it does not follow that using (for example) the HTTP GET request will return the resource, a representation of that resource, or indeed anything at all. There is no absolute obligation within the W3C model for the owner of an identified resource to return anything.
As long as the two concepts are understood as separate concepts (as they are supposed to be), W3C would argue that there is no problem. Yet the expectation of the user community not involved in theoretical discussion, is that an HTTP identifier is an address (since that is what it looks like), and thus it ought to return something when the URI is 'dereferenced' (W3C technical language for 'used as a link'). For W3C, the HTTP identifier format (they don't exclude others) is perhaps just a convenient way of creating a unique string. However the distinction between a URI and a URL is not signified in any way in the formal appearance of the HTTP identifier, and it cannot be determined by a user without at least sending a request for an HTTP header. Thus the concepts are not separately indicated at the identifier level.
For anyone interested in the future of identifiers for digital objects this was an extremely interesting event, covering a huge amount of territory in two days. There were a couple of sessions where much of the discussion came from the floor, and one session was devoted entirely to questions of interest to the community at large, rather than the more arcane theoretical issues. However it was clear that there are many more questions around than there are answers, and that there is still a great deal to do before identifiers can be subsumed into applications, and, to the user of services, become an invisible aspect of the technology.
- The DCC Web Site contains all presentations given at this event, plus mp3 audio of the sessions: