Web Magazine for Information Professionals

Extending Metadata for Digital Preservation

Michael Day suggests how the concept of metadata could be extended to provide information in the specific field of digital preservation.

Metadata for resource discovery and access

When the library and information community discuss metadata, the most common analogy given is the library catalogue record. Priscilla Caplan, for example, has defined metadata as a neutral term for cataloguing without the “excess baggage” of the Anglo-American Cataloguing Rules or the MARC formats [1]. The most well-known metadata initiative, the Dubin Core Metadata Element Set, has the specific aim of supporting resource discovery in a network environment.

The Internet has many virtues, but it - and in particular the World Wide Web - was not designed specifically for information retrieval. The search technologies that evolved - Archie for anonymous ftp sites, Veronica for gopher and search engines for the Web - were developed as a response to perceived needs rather than as an integral part of the original concept. Search engines like Digital Equipment Corporation’s Alta Vista construct indexes by means of robots which ‘crawl’ through the Web collecting indexing information. This is a less than perfect solution as Web sites typically do not contain all the relevant information for automated indexing and if they do it is unlikely to be in a form identifiable to most robots. It is possible that if Dublin Core metadata is attached to or embedded in a large number of Web documents, a new generation of Web robots or harvesters could collect this metadata and make it available through improved search services.

Search engines, however, are not the only solution to the Internet resource discovery problem. An alternative approach has been to get people to select resources according to agreed criteria and then make them available via a Web site, where they can be browsed or searched. Familiar examples of this approach are NISS, BUBL and the Electronic Libraries (eLib) access to network resources services like SOSIG and OMNI. Typically, these services produce descriptive metadata in some format which can then be used as a basis for information retrieval.

Metadata formats come in a variety of shapes and sizes [2] [3]. Dublin Core is intended to be a simple structured record which can, if required, be enhanced or mapped to more complex records [4]. Other formats in use include the ROADS/IAFA templates used by the ROADS based eLib access to networked resources services and USMARC which is used by the Intercat project led by OCLC. Dublin Core may have a role as a minimal metadata set in order to allow for interoperability between other, more complex, metadata formats.

Library and information professionals with an interest in the Internet should be interested in the development of metadata for resource discovery. Clifford Lynch points out that if the Internet is to continue to thrive as a new means of communication, “something very much like traditional library services will be needed to organize, access and preserve networked information” [5]. Metadata does have an acknowledged role in the organisation of and access to networked information, but it could additionally be important in the general area of digital preservation

Digital preservation

The preservation of digital information can be seen as one of the greatest challenges for the library and information professions at the end of the twentieth-century [6] [7]. It is easy to get caught-up in the general enthusiasm for digital libraries but more consideration needs to be given to the problem of making this information available to future generations. Useful work has been published by the US Commission on Preservation and Access (CPA), including an important report by a Task Force on the Archiving of Digital Information (TFADI) jointly commissioned with the Research Libraries Group [8].

Libraries have traditionally understood at least one of their roles as the preservation of information for future use. This has been especially true of national libraries and selected institutions in the research library sector. For example, in early 1996 the British Library proposed that legal deposit should be extended to include non-print materials, even allowing for the possibility of collecting networked (on-line) publications if this became technically and economically feasible [9]. Most countries in Europe, North America and Australasia have either extended legal deposit to electronic publications or are considering doing so. This implies a commitment, by national libraries at least, to the long-term preservation of digital publications.

The main problem with digital preservation is that digital technology, in comparison to print, is an extremely fragile medium for the cultural memory of the world [8]. The most commonly given example of this fragility is the 1960 United States Census, where raw data stored on magnetic tapes apparently became obsolete and, to all intents and purposes, unreadable by the late nineteen-seventies [10]. Digital information has two main weaknesses:

  1. The storage medium - digital storage media, whether magnetic or optical, are subject to relatively rapid decay: especially when compared with print.
  2. The hardware and software - digital information is machine-dependent, and to be ‘read’ accurately it needs specific computer hardware and software. Unfortunately, hardware and software quickly become obsolescent or otherwise unusable [11].

Proposed solutions to these problems usually involve periodic ‘refreshing’ or recopying of the digital information onto new media and the occasional ‘migration’ of data into new formats. Assuming that some answer can be found to these problems, there remains the important issue of intellectual preservation. Even when digital information has migrated into new formats, there will remain a need for users to be sure that the ‘document’ they are looking at is the one that they were looking for [12].

The archives community - especially in the United States - has been addressing these issues for some time now. This has partly been led by the need for electronic records to be accepted in legal evidence [13] but the fact that electronic records are increasingly created in a variety of important situations: government; health-care; business-transactions, etc., has resulted in a renewed interest in authentication and validation issues. A research project at the University of Pittsburgh School of Library and Information Science has been investigating “Functional Requirements for Recordkeeping” and has attempted to identify and specify the fundamental properties of records [14]. The functional requirements identified by the project emphasised that records should be comprehensive, identifiable, complete, authorised, preserved, removable, exportable, accessible and redactable (Ibid. Table 1). The project suggested that records (or ‘Business Acceptable Communications’) should carry a six layer structure of metadata which would contain not only a ‘Handle Layer’ (including a unique identifier and resource discovery metadata) but also very detailed information on terms & conditions of use, data structure, provenance, content and the use of the record after its creation. The metadata is intended to carry all the necessary information that would allow the record to be used - even when the “individuals, computer systems and even information standards under which it was created have ceased to be” (Ibid.).

Extending metadata for digital preservation

As the archives community are seriously considering using metadata to ensure the integrity and longevity of records, it might be useful to investigate whether a similar approach would be useful for digital preservation in a library context - and in particular for networked documents. Resource discovery metadata like Dublin Core already contain relevant elements like FORM or RIGHTS which can be used to give basic details about the technical or legal context of a document, but this would need to be extended so that future systems would know exactly how to accurately interpret the document itself, or to migrate the data to a non-obsolete format. Where complex documents or publications are concerned, there may be some future in investigating Jeff Rothenberg’s concept of ‘encapsulating’ data together with all application and system software required to access it and a description of the original hardware environment [15]. Text only ‘bootstrap standard’ metadata would be then attached to the data which would provide contextual information and an explanation of how to decode the record itself. Rothenberg envisages that future computer systems could use this information to emulate the software so that a document can be seen in as close as possible to its original context. This sort of approach, if technically feasible, might be useful for the preservation of multi-media publications or for hyper-textual documents with all links maintained.

Conclusion

Preservation metadata may, therefore, have a useful role in helping ensure that digital information will be available to future generations. Several important questions remain:

If a metadata approach to digital preservation is an appropriate way to proceed, these - and other - questions will have to be seriously considered.

References

  1. Caplan, P., 1995, You call it corn, we call it syntax-independent metadata for document-like objects. The Public-Access Computer Systems Review, 6 (4), 19-23,
    http://info.lib.uh.edu/pr/v6/n4/capl6n4.html
  2. Heery, R., 1996, Review of metadata formats. Program, 30 (4), 345-373,
    http://www.ukoln.ac.uk/metadata/review.html
  3. Dempsey, L., 1996, ROADS to Desire: some UK and other European metadata and resource discovery projects. D-Lib Magazine, July/August,
    http://www.dlib.org/dlib/july96/07dempsey.html
  4. Lynch, C., 1997, Searching the Internet. Scientific American, 276 (3), March, 44- 48.
  5. Weibel, S., et al., 1995, OCLC/NCSA Metadata Workshop Report,
    http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html
  6. Rothenberg, J., 1995, Ensuring the longevity of digital documents. Scientific American, 272 (1), January, 24-29.
  7. Day, M.W, 1997, Preservation of electronic information: a bibliography,
    http://www.ukoln.ac.uk/~lismd/preservation.html
  8. Task Force on the Archiving of Digital Information, 1996, Preserving digital information: report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access and the Research Libraries Group. Washington, D.C.: Commission on Preservation and Access,
    http://www.rlg.org/ArchTF/
  9. British Library Research and Innovation Centre, 1996, Proposal for the legal deposit of non-print publications. London: British Library,
    http://portico.bl.uk/ric/legal/legalpro.html
  10. Weinberg, G.L., 1995, The end of Ranke’s history? Reflections on the fate of history in the twentieth century. In: Weinberg, G.L., Germany, Hitler, and World War II: essays in modern German and World history. Cambridge: Cambridge University Press, 325-336.
  11. Mallinson, J.C., 1988, On the preservation of human- and machine-readable records. Information Technology and Libraries, 7, 19-23.
  12. Graham, P.S., 1994, Intellectual preservation: electronic preservation of the third kind. Washington D.C.: Commission on Preservation and Access,
    http://www-cpa.stanford.edu/cpa/reports/graham/intpres.html
  13. Piasecki, S.J., 1995, Legal admissibility of electronic records as evidence and implications for records management. American Archivist, 58 (1), 54-64.
  14. Bearman, D. and Sochats, K., 1996, Metadata requirements for evidence. Pittsburgh, Penn.: Archives and Museum Informatics,
    http://www.lis.pitt.edu/~nhprc/BACartic.html
  15. Rothenberg, J., 1996, Metadata to support data quality and longevity,
    http://www.computer.org/conferen/meta96/rothenberg_paper/ieee.data-quality.html

Author Details

Michael Day,
Metadata Research Officer,
UKOLN
Email: M.Day@ukoln.ac.uk
Tel: 01225 323923
Fax: 01225 826838
Address: UKOLN, University of Bath, Bath, BA2 7AY