Web Magazine for Information Professionals

Metadata: I Am a Name and a Number

Paul Miller on Digital Object Identifiers.

People, places, and things are identified in any number of different ways.

I, for example, have a National Insurance Number, a staff payroll number, several bank account numbers, and assorted frequent flyer programme membership numbers, all of which are the handles that certain groups of people use to identify me. I also have a name and, associated to me, three telephone numbers and at least two e-mail addresses. Neither telephone number nor e-mail address truly identifies me the person of course, but they might well be seen as equally useful a means of 'retrieving' me as my name or any other associated identifier.

Each of these is used by some one at some time to identify one particular facet of Paul Miller, whether in my guise as the taxpayer, the member of staff, the bank customer, or whatever. None of them is particularly good for identifying Paul Miller the person over any extended period of time, however. I might take a job somewhere nice, sunny and far away from the London Underground, thus rendering National Insurance and payroll numbers effectively useless. I might change bank account or airline, and I'll almost certainly change phone numbers and e-mail addresses.

Even my name isn't a sure way of identifying me on its own. Registered as Andrew Paul, yet called Paul from birth so far as I'm aware, I face a constant tide of confusion from officialdom the world over as they try to associate Paul the flesh and blood with A.P. the faceless statistic in their databases... And then there's the long-held feeling that maybe I should change my surname to Aardvark-Miller, just to appear first in bibliographies...

With the digital resources we make use of in our learning, work, or research, too, there are a plethora of different identification forms available. This paper looks at a few of them, and considers ways in which consistent and reliable identification will help to make JISC's Distributed National Electronic Resource [1] (DNER) reality.

What are you talking about?

There is a lot of quite theoretical debate around the use of identifiers, much of which is over-specific for Ariadne. Elements of it, though, are worth repeating here in order to make it clearer as to why some of the debates are even worth having.

Perhaps the biggest issue is that of what is actually being identified by one of these identifiers, and this is most usefully illustrated with an example...

Umberto Eco's excellent The Name of the Rose is a work of fiction. Using terminology from IFLA's Functional Requirements for Bibliographic Records (FRBR) [2], it exists as a 'work'; an act of intellectual creation on the part of Umberto Eco. Identifying this work unambiguously is, perhaps, more difficult than it first appears, especially as the title by which most Ariadne readers know this work is not that which its Italian author would use.

Quickly scanning Amazon [3], it is a simple task to find four different paperback 'manifestations' of the book in English (ISBNs 0749397055, 0156003708, 0156001314, 0436204223), one 'expression' in audio (ISBN 0753104865), and a German translation (ISBN 3423105518). There's also the Sean Connery film, of course ([publisher's?] Catalogue Number 0842303), the original publication in Italian, and presumably many other versions in hardback or in other languages.

For publishers, distributors and resellers, the differentiation between these products is of vital importance. For someone wanting to read the book in English, however, the four ISBNs are a positive hindrance, and for the Renaissance Man (should such a beast still exist), differentiations between form, language and medium are doubtless an irrelevance. In this case, the commonly cited identifier refers to a particular imprint of the work, which itself remains essentially unchanged across most of these different identifiers. The identifier hasn't actually helped the end user much, and may even mislead in a number of situations. Imagine asking for ISBN 0749397055 at your local Waterstone's and being told they haven't got it. You leave, never knowing that your branch actually stocked the Harcourt version (ISBN 0156001314).

Like the ISBN, other common identifiers often fail to identify a particular work or expression. The well-known URL, for example, which is now a common feature of academic citation lists, cereal packets, and advertising hoardings, doesn't actually identify any content at all. Rather, it points to an individual computer file, on a specific computer, within a named domain. The contents of that file today may be very different from the contents of the same file on the day I cite it to you. News [4] and 'Portal' [5] sites, are particularly bad for this, but even the home page of a highly respected organization [6] is unlikely to remain static for very long these days. Increasingly, therefore, passing on a URL to friends and colleagues is the equivalent of saying "You're likely to find something interesting here" rather than — the possibly intended — "Take a look here at this interesting fact/article/whatever".

These, and other, examples illustrate that identifiers may be applied in the identification of many different classes of object or resource. To be most useful, it's arguable that identifiers should be consistently applied (so that an ISBN always identifies a manifestation, and never a work, for example).

If I only had a brain...

A further important consideration in defining identifiers is whether or not the numbering scheme used should be 'intelligent' or 'unintelligent' [7]. Put simply, an unintelligent identifier is just a number, and is totally meaningless without reference to some central database or list. An intelligent identifier has some meaning, and can be unpacked to a certain degree. A UK telephone area code for example has some intelligence. If it starts '01' or '02', you know it's a normal telephone line. If it starts '07', you know it's a mobile phone, etc. Extending the intelligence somewhat, if the code is '01904' you know it's in York, and if it's '020 7' you know you're calling Central London.

ISBN's, despite appearing pretty opaque to most of us, are actually intelligent too. The first part of an ISBN identifies the country, language or geographic area in which the book was published. The second part identifies the publisher, and the third is the number of the title itself.

On the surface, intelligent identifiers appear quite useful, as they allow the user to work things out for themselves to some extent. There are, however, problems. Firstly, things change. If a book is published by HarperCollins, its ISBN will contain an explicit reference to Collins as publisher. If the publisher sells rights for this book to a second organization, or is taken over by that organization, the ISBN should really change to reflect this. However, such a change may take several years to occur. Secondly, intelligent identifiers generally reflect a single world view. UK telephone numbers, for example, are broken down by geography. Might it not be more useful to have all business numbers start '01' and all domestic numbers '02'? Extending this, might all banks not have a number starting '011' and plumbers '015'? Imagine if double glazing companies all had numbers starting '013', and you could program your phone to refuse their calls all by itself? That benefit alone surely has to be worth the inconvenience of restructuring the entire phone numbering system! With ISBN's, the group identifier used to specify country, language, or geographical area isn't as useful as it sounds. The prefix '0' or '1' is given to books published in Australia, the English-speaking parts of Canada, Ireland, New Zealand, South Africa, the UK, the United States and Zimbabwe!

The use of any intelligent scheme of identification forces the classification needs of a particular community upon all those who make use of the identifier. For most users, this doesn't matter particularly (how many of you knew what you've just learned about ISBNs? I certainly didn't until I started reading for this article, and have managed to use ISBNs for years in a state of blissful ignorance). For some users, the classification is a positive boon (whilst it remains current and relevant, at least), for some a minor inconvenience, and it probably only has an adverse impact upon a relatively small proportion of users. The problems begin in earnest, though, when anyone places too much faith in what the numbers say... ("The publisher code in the ISBN tells me this book is published by HarperCollins, so it must be...").

Despite these, and other, issues with their use, it remains important to be able to identify objects in a whole range of different ways, and for a multitude of purposes. In an on-line environment such as that proposed for the UK's Distributed National Electronic Resource [1] unambiguous identification of resources from a variety of providers for a plethora of users and uses is essential. Without it, the structure of the DNER is unsustainable. The Digital Object Identifier [8], or DOI, is a clear mainstay of this strategy, certainly with respect to digitally offered bibliographic material and possibly more widely. It is for this reason that JISC has joined the International DOI Foundation [9], and looks forward to exploring the ways in which DOIs might be deployed across the range of JISC content.

The Digital Object Identifier

The DOI is both a persistent identifier, and a system which processes that identifier on the Internet to deliver services. The DOI identifies Creations (products of human imagination or endeavor in which rights may exist; intellectual property). It is not an identifier of all Internet "resources", as defined in URL, URI, etc. [10]

Formed in 1998, the International DOI Foundation (IDF) is a membership organization comprising many of the world's major publishers as well as those, such as JISC, with a significant interest in this area. The IDF works to develop "a common and well understood approach to referencing objects [and recognises that this] is essential to the evolution of services" [11].

A DOI is a wholly unique number, used to unambiguously label a piece of intellectual property. In principle, a single DOI can be resolved to multiple physical manifestations of that intellectual property, overcoming the difficulty of identifying and describing the existence of a resource at more than one location. It is also possible to associate metadata with a DOI, providing additional contextual information of use both to human users and to automated tools acting on the user's behalf. The DOI metadata structure is currently quite simple, and based upon the work of the European <indecs> project [12], but significantly increases the value of the DOI itself.

The DOI itself follows a structure similar to that of the common web URL and other Uniform Resource Identifiers (URI) [13].

 

  Prefix Suffix 
Scheme Directory Code Registrant's Code   
doi://10.1045/december99-miller 
  
doi://10.1045/december99-miller 

The Scheme, 'doi', is used to inform the browser as to how the subsequent identifier should be resolved. Standard web browsers do not understand the doi scheme by default, and need to be supplemented with the Handle System Resolver plug-in from CNRI in the United States [14]. This simple addition to the web browser makes it possible to follow DOIs in the same was as the more common URL.

The DOI Prefix comprises two components, and is specified by the International DOI Foundation; a directory code and a registrant's code. The directory code identifies the naming authority, and is currently '10', denoting the DOI Foundation. The registrant's code is a number assigned by the DOI Foundation to the publisher or other body assigning identifiers. In the example, the registrant's code of '1045' identifies D-Lib Magazine, and JISC could conceivably have its own registrant's code in this way.

The final component of a DOI is the suffix, and this part is outside the direct control of the International DOI Foundation. The number making up the DOI suffix may be either intelligent or unintelligent in form. DOIs on the web site of the International DOI Foundation [9], for example, are unintelligent; their home page simply being doi://10.1000/1. Others, such as the D-Lib Magazine example above, make use of intelligent numbers to build the DOI suffix. D-Lib Magazine builds the suffix based upon the month and year of publication, combined with the surname of the first author. This suffix can also be built from formal numbering systems such as an ISBN or, as the rather daunting doi://10.1002/(SICI)1097-4571(199806)49:8<693::aid-asi4>3.0.CO;2-O from Wiley demonstrates, a SICI [15].

Despite extensive deployment of DOIs in publishers' back-end database systems, I was unable to find any web-visible examples of an ISBN being embedded in a DOI as with this SICI example, but the parallel is obvious. If Wiley had published one of the editions of The Name of the Rose discussed earlier, their DOI might simply be doi://10.1002/(ISBN)0749397055.

Resolution

Whatever their structure, DOIs are always 'resolved' with reference to an external service. When a document is assigned a DOI by a publisher, it is the responsibility of the publisher to lodge the DOI, the URL to which it points, and associated metadata with the resolution service. A user entering a DOI will have it automatically resolved, and will be pointed to the URL at which the document can be found. If the URL changes for some reason, the entry in the resolution database is simply changed and the DOI remains unaltered, ensuring a degree of permanence to the underlying intellectual content which the user is looking for.

References

  1. A document describing the Distributed National Electronic Resource (DNER) is at: http://www.jisc.ac.uk/pub99/dner_desc.html
  2. The report, Functional Requirements of Bibliographic Records, is at: http://www.ifla.org/VII/s13/frbr/frbr.htm
  3. Amazon.co.uk is at: http://www.amazon.co.uk/
  4. The BBC News site is at: http://news.bbc.co.uk/
  5. Yahoo's Portal for the UK and Ireland is at: http://uk.yahoo.com/
  6. UKOLN is at: http://www.ukoln.ac.uk/
  7. The paper, Unique Identifiers: a brief introduction by Brian Green and Mark Bide, is at: http://www.bic.org.uk/uniquid.html
  8. Draft standard Z39.84, Syntax for the Digital Object Identifier, is at: http://www.niso.org/Z3984.html, and is due for formal publication during 2000
  9. The International DOI Foundation is at: http://www.doi.org/.
    The DOI for this page is doi://10.1000/1 [note].
  10. The Digital Object Identifier is introduced at: http://www.doi.org/about_the_doi.html.
    The DOI for this page is doi://10.1000/7 [note].
  11. The paper, Digital Object Identifier: implementing a standard digital identifier as the key to effective digital rights management by Norman Paskin, is at: http://www.doi.org/doi_presentations/aprilpaper.pdf.
    The DOI for this page is doi://10.1000/174 [note].
  12. The <indecs> project is at: http://www.indecs.org/
  13. Request for Comment (RFC) 2396, Uniform Resource Identifiers (URI): Generic Syntax, is at: http://www.cis.ohio-state.edu/htbin/rfc/rfc2396.html
  14. CNRI's Handle System Resolver plug-in is at: http://www.handle.net/resolver/index.html
  15. The Serial Item and Contribution Identifier (SICI) standard is at: http://sunsite.berkeley.edu/SICI/

Note: To resolve DOIs cited in this article, you will need to download and install the Handle System Resolver plug-in for your web browser. The plug-in is currently only available for various flavours of Windows, with MacOS and Solaris versions due later in the year. Download the plug-in from http://www.handle.net/resolver/index.html.

A number of examples in this article made use of a LinkBaton to look up ISBNs for books at your favourite book provider. Visit http://www.linkbaton.com/ for more information on LinkBatons and the ways that you might soon be able to use them.

Author Details

 Paul Miller
Interoperability Focus
UKOLN
c/o Academic Services: Libraries
University of Hull
HULL
HU6 7RX
United Kingdom

Email: P.Miller@ukoln.ac.uk
Web site: www.ukoln.ac.uk/interop-focus/