Extending metadata for digital preservation
Michael Day suggests how the concept of metadata could be extended to provide
information in the specific field of digital preservation.
This article appears in the Web, and not the print, version of Ariadne.
Metadata for resource discovery and access
When the library and information community discuss metadata, the most common analogy given is the
library catalogue record. Priscilla Caplan, for example, has defined metadata as a neutral term for
cataloguing without the "excess baggage" of the Anglo-American Cataloguing Rules or the MARC formats
[1]. The most well-known metadata initiative, the Dubin Core Metadata Element
Set, has the specific aim of supporting resource discovery in a network environment.
The Internet has many virtues, but it - and in particular the World Wide Web - was not designed
specifically for information retrieval. The search technologies that evolved - Archie for anonymous ftp
sites, Veronica for gopher and search engines for the Web - were developed as a response to perceived
needs rather than as an integral part of the original concept. Search engines like Digital Equipment
Corporation's Alta Vista construct indexes by means of robots which 'crawl' through the Web collecting
indexing information. This is a less than perfect solution as Web sites typically do not contain all the
relevant information for automated indexing and if they do it is unlikely to be in a form identifiable to most
robots. It is possible that if Dublin Core metadata is attached to or embedded in a large number of Web
documents, a new generation of Web robots or harvesters could collect this metadata and make it available
through improved search services.
Search engines, however, are not the only solution to the Internet resource discovery problem. An
alternative approach has been to get people to select resources according to agreed criteria and then make
them available via a Web site, where they can be browsed or searched. Familiar examples of this approach
are NISS, BUBL and the Electronic Libraries (eLib) access to network resources services like SOSIG and
OMNI. Typically, these services produce descriptive metadata in some format which can then be used as a
basis for information retrieval.
Metadata formats come in a variety of shapes and sizes [2] [3]. Dublin Core is intended to be a simple structured record which can, if required, be
enhanced or mapped to more complex records [4]. Other formats in use include
the ROADS/IAFA templates used by the ROADS based eLib access to networked resources services and
USMARC which is used by the Intercat project led by OCLC. Dublin Core may have a role as a minimal
metadata set in order to allow for interoperability between other, more complex, metadata formats.
Library and information professionals with an interest in the Internet should be interested in the
development of metadata for resource discovery. Clifford Lynch points out that if the Internet is to continue
to thrive as a new means of communication, "something very much like traditional library services will be
needed to organize, access and preserve networked information" [5]. Metadata
does have an acknowledged role in the organisation of and access to networked information, but it could
additionally be important in the general area of digital preservation
Digital preservation
The preservation of digital information can be seen as one of the greatest challenges for the library and
information professions at the end of the twentieth-century [6]
[7]. It is easy to get caught-up in the general enthusiasm for digital libraries but
more consideration needs to be given to the problem of making this information available to future
generations. Useful work has been published by the US Commission on Preservation and Access (CPA),
including an important report by a Task Force on the Archiving of Digital Information (TFADI) jointly
commissioned with the Research Libraries Group [8].
Libraries have traditionally understood at least one of their roles as the preservation of information for
future use. This has been especially true of national libraries and selected institutions in the research library
sector. For example, in early 1996 the British Library proposed that legal deposit should be extended to
include non-print materials, even allowing for the possibility of collecting networked (on-line) publications
if this became technically and economically feasible [9]. Most countries in Europe,
North America and Australasia have either extended legal deposit to electronic publications or are
considering doing so. This implies a commitment, by national libraries at least, to the long-term
preservation of digital publications.
The main problem with digital preservation is that digital technology, in comparison to print, is an
extremely fragile medium for the cultural memory of the world [8]. The most
commonly given example of this fragility is the 1960 United States Census, where raw data stored on
magnetic tapes apparently became obsolete and, to all intents and purposes, unreadable by the late
nineteen-seventies [10]. Digital information has two main weaknesses:
- The storage medium - digital storage media, whether magnetic or optical, are subject to relatively
rapid decay: especially when compared with print.
- The hardware and software - digital information is machine-dependent, and to be 'read' accurately it
needs specific computer hardware and software. Unfortunately, hardware and software quickly become
obsolescent or otherwise unusable [11].
Proposed solutions to these problems usually involve periodic 'refreshing' or recopying of the digital
information onto new media and the occasional 'migration' of data into new formats. Assuming that some
answer can be found to these problems, there remains the important issue of intellectual preservation. Even
when digital information has migrated into new formats, there will remain a need for users to be sure that
the 'document' they are looking at is the one that they were looking for [12].
The archives community - especially in the United States - has been addressing these issues for some
time now. This has partly been led by the need for electronic records to be accepted in legal evidence
[13] but the fact that electronic records are increasingly created in a variety of
important situations: government; health-care; business-transactions, etc., has resulted in a renewed interest
in authentication and validation issues. A research project at the University of Pittsburgh School of Library
and Information Science has been investigating "Functional Requirements for Recordkeeping" and has
attempted to identify and specify the fundamental properties of records [14]. The
functional requirements identified by the project emphasised that records should be comprehensive,
identifiable, complete, authorised, preserved, removable, exportable, accessible and redactable (Ibid. Table
1). The project suggested that records (or 'Business Acceptable Communications') should carry a six layer
structure of metadata which would contain not only a 'Handle Layer' (including a unique identifier and
resource discovery metadata) but also very detailed information on terms & conditions of use, data
structure, provenance, content and the use of the record after its creation. The metadata is intended to carry
all the necessary information that would allow the record to be used - even when the "individuals, computer
systems and even information standards under which it was created have ceased to be" (Ibid.).
Extending metadata for digital preservation
As the archives community are seriously considering using metadata to ensure the integrity and longevity of
records, it might be useful to investigate whether a similar approach would be useful for digital preservation
in a library context - and in particular for networked documents. Resource discovery metadata like Dublin
Core already contain relevant elements like FORM or RIGHTS which can be used to give basic details
about the technical or legal context of a document, but this would need to be extended so that future systems
would know exactly how to accurately interpret the document itself, or to migrate the data to a non-obsolete
format. Where complex documents or publications are concerned, there may be some future in
investigating Jeff Rothenberg's concept of 'encapsulating' data together with all application and system
software required to access it and a description of the original hardware environment
[15]. Text only 'bootstrap standard' metadata would be then attached to the data
which would provide contextual information and an explanation of how to decode the record itself.
Rothenberg envisages that future computer systems could use this information to emulate the software so
that a document can be seen in as close as possible to its original context. This sort of approach, if
technically feasible, might be useful for the preservation of multi-media publications or for hyper-textual
documents with all links maintained.
Conclusion
Preservation metadata may, therefore, have a useful role in helping ensure that digital information will be
available to future generations. Several important questions remain:
- Who will define what preservation metadata are needed?
- Who will decide what needs to be preserved?
- Who will archive the preserved information?
- Who will create the metadata?
- Who will pay for it?
If a metadata approach to digital preservation is an appropriate way to proceed, these - and other - questions
will have to be seriously considered.
References
- Caplan, P., 1995, You call it corn, we call it syntax-independent metadata for
document-like objects. The Public-Access Computer Systems Review, 6 (4), 19-23,
http://info.lib.uh.edu/pr/v6/n4/capl6n4.html
- Heery, R., 1996, Review of metadata formats. Program, 30 (4), 345-373,
http://www.ukoln.ac.uk/metadata/review.html
- Dempsey, L., 1996, ROADS to Desire: some UK and other European metadata and
resource discovery projects. D-Lib Magazine, July/August,
http://hosted.ukoln.ac.uk/mirrored/dlib/dlib/dlib/july96/07dempsey.html
- Lynch, C., 1997, Searching the Internet. Scientific American, 276 (3), March, 44-
48.
- Weibel, S., et al., 1995, OCLC/NCSA Metadata Workshop Report,
http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html
- Rothenberg, J., 1995, Ensuring the longevity of digital documents. Scientific
American, 272 (1), January, 24-29.
- Day, M.W, 1997, Preservation of electronic information: a bibliography,
http://www.ukoln.ac.uk/~lismd/preservation.html
- Task Force on the Archiving of Digital Information, 1996, Preserving digital
information: report of the Task Force on Archiving of Digital Information commissioned by the
Commission on Preservation and Access and the Research Libraries Group. Washington, D.C.:
Commission on Preservation and Access,
http://www.rlg.org/ArchTF/
- British Library Research and Innovation Centre, 1996, Proposal for the legal
deposit of non-print publications. London: British Library,
http://portico.bl.uk/ric/legal/legalpro.html
- Weinberg, G.L., 1995, The end of Ranke's history? Reflections on the fate of
history in the twentieth century. In: Weinberg, G.L., Germany, Hitler, and World War II: essays in modern
German and World history. Cambridge: Cambridge University Press, 325-336.
- Mallinson, J.C., 1988, On the preservation of human- and machine-readable
records. Information Technology and Libraries, 7, 19-23.
- Graham, P.S., 1994, Intellectual preservation: electronic preservation of the
third kind. Washington D.C.: Commission on Preservation and Access,
http://www-cpa.stanford.edu/cpa/reports/graham/intpres.html
- Piasecki, S.J., 1995, Legal admissibility of electronic records as evidence and
implications for records management. American Archivist, 58 (1), 54-64.
- Bearman, D. and Sochats, K., 1996, Metadata requirements for evidence.
Pittsburgh, Penn.: Archives and Museum Informatics,
http://www.lis.pitt.edu/~nhprc/BACartic.html
- Rothenberg, J., 1996, Metadata to support data quality and longevity,
http://www.computer.org/conferen/meta96/rothenberg_paper/ieee.data-quality.html
Author Details
Michael Day,
Metadata Research Officer,
UKOLN
Email:
M.Day@ukoln.ac.uk
Tel: 01225 323923
Fax: 01225 826838
Address: UKOLN, University of Bath, Bath, BA2 7AY

Material on this page is copyright
Ariadne/original authors.
This article last updated/links checked on 18-May-1997