Metadata for Digital Preservation: An Update
In May 1997, the present author produced a short article for this column entitled "Extending metadata for digital preservation" . The article introduced the idea of using metadata-based methods as a means of helping to manage the process of preserving digital information objects. At the time the article was first published, the term 'metadata' was just beginning to be used by the library and information community (and others) to describe 'data about data' that could be used for resource discovery. So, for example, the most well-known metadata initiative was (and remains) the Dublin Core Metadata Initiative, initially concerned with defining a core metadata element set for Internet resource discovery . It is now widely accepted that identifying and recording appropriate metadata is a key part of any strategy for preserving digital information objects.
This brief update will report on a number of more recent initiatives that have relevance to preservation metadata, but will take a specific look at currently proposed digital preservation strategies and the development of recordkeeping metadata schemes. It will also introduce the Open Archival Information System (OAIS) reference model that is beginning to influence a number of digital preservation based projects. This review of activities is partly based on a review of preservation metadata initiatives carried out for the Cedars project in the summer of 1998 , but it has been updated to include reference to additional projects and standards.
Digital preservation strategies and metadata
If one ignores the technology preservation option, there are currently two main proposed strategies for long-term digital preservation: first the emulation of original hardware, operating systems and software; and secondly the periodic migration of digital information from one generation of computer technology to a subsequent one .
Emulation strategies are based on the premise that the best way to preserve the functionality and 'look-and-feel' of digital information objects is to preserve it together with its original software so that it can be run on emulators that can mimic the behaviour of obsolete hardware and operating systems. Emulation strategies would involve encapsulating a data object together with the application software used to create or interpret it and a description of the required hardware environment - i.e., a specification for an emulator. It is suggested that these emulator specification formalisms will require human readable annotations and explanations (metadata). Jeff Rothenberg says, for example, that the emulation approach requires "the development of an annotation scheme that can save ... explanations [of how to open an encapsulation] in a form that will remain human-readable, along with metadata which provide the historical, evidential and administrative context for preserving digital documents" .
Migration - the periodic migration of digital information from one generation of computer technology to a subsequent one - is currently the most tried-and-tested preservation strategy. However, as Seamus Ross points out, data migration inevitably leads to some losses in functionality, accuracy, integrity and usability . In some contexts, this is likely to be important. David Bearman, for example, has pointed out that if electronic records are migrated to new software environments, "content, structure and context information must be linked to software functionality that preserves their executable connections" . If this, however, cannot be done, he suggests that "representations of their relations must enable humans to reconstruct the relations that pertained in the original software environment". Successful migration strategies will, therefore, depend upon metadata being created to record the migration history of a digital object and to record contextual information, so that future users can either reconstruct or - at the very least - begin to understand the technological environment in which a particular digital object was created.
There is currently a debate about the relative merits of emulation and migration strategies. Rothenberg, for example, claims that migration has little to recommend it and calls it "an approach based on wishful thinking". He criticises the approach because he feels that it is impossible to predict exactly what will happen and because the approach is labour-intensive and expensive .
In the absence of any alternative, a migration strategy may be better than no strategy at all; however, to the extent that it provides merely the illusion of a solution, it may in some cases actually be worse that nothing. In the long run, migration promises to be expensive, error-prone, at most partially successful, and ultimately infeasible.
Bearman questions the basis of this opinion and opines that Rothenberg is mistaken because he assumes that what needs to be preserved is the information system itself, rather than that which the system produces. By this he means capturing "all transactions entering and leaving the system when they are created, ensuring that the original context of their creation and content is documented, and that the requirements of evidence are preserved over time". In any case, Bearman argues that the emulation approach is itself extremely complicated .
Rothenberg's proposal does not even try to define the elements of metadata specifications that would be required for the almost unimaginably complex task of emulating proprietary application software of another era, running on, and in conjunction with, application interface programs from numerous sources, on operating systems that are obsolete, and in hardware environments that are proprietary and obsolete.
The debate is likely to continue for at least as long as it takes to test the emulation approach. For example, some work is being carried out into the use of emulation techniques for preservation as part of the JISC/NSF funded Cedars 2 project  and as part of the European Union-funded NEDLIB project.
Regardless of whether emulation-based or migration-based preservation strategies are adopted - and it is likely that both will have some role - the long-term preservation of digital information will involve the creation and maintenance of metadata. Clifford Lynch describes the function of some of this as follows:
Within an archive, metadata accompanies and makes reference to each digital object and provides associated descriptive, structural, administrative, rights management, and other kinds of information. This metadata will also be maintained and will be migrated from format to format and standard to standard, independently of the base object it describes.
As a result preservation metadata has, therefore, become a popular area for research and development in the archive and library communities. Archivists and records managers have concentrated on the development of recordkeeping metadata, while other groups have dealt with defining metadata specifications for particular needs. For example, the library and information community has initiated some important work:
- The Research Libraries Group - a Working Group on Preservation Issues of Metadata commissioned by the Research Libraries Group (RLG) defined the semantics of sixteen metadata elements that would be able to serve the preservation requirements of digital images .
- The National Library of Australia - a team based at the National Library of Australia (NLA) developed a logical data model (based on entity-relationship modelling) to help identify the particular entities (and their associated metadata) that needed to be supported within its PANDORA (Preserving and Accessing Networked DOcumentary Resources of Australia) proof-of-concept archive . This model has been revised recently for use within the NLA's Digital Services Project .
Archivists and "Recordkeeping metadata"
As has been mentioned, another group with a keen interest in long-term digital preservation are the archives and records management communities. Traditional approaches to the archival management of records and archives tended to be based on physical records being transferred into the physical custody of an archival repository at the end of their active life-cycle. The growing existence of digital records, however, has resulted in a widespread reassessment of archival theory and practice . For example, in the digital environment, it is no longer sufficient for archivists to make decisions about the retention or disposal of records at the end of their active life. By that time it may be too late to ensure their preservation in any useful form. Greg O'Shea of the National Archives of Australia has commented that the ideal time for archivists attention to be given to digital records, "is as part of the systems development process at the point systems are being established or upgraded, i.e. even before the records are created" . Some archivists, particularly Australian ones, have begun to shift attention from the traditional 'life-cycle' approach to records and have started to develop archival management practices based on the concept of a 'records continuum'.
A continuum approach to records means that a major change in the understanding of archival description (or metadata) is required. Sue McKemmish and Dagmar Parer have summarised what this means .
If archival description is defined as the post-tranfer process of establishing intellectual control over archival holdings by preparing descriptions of the records, then those descriptions essentially function as cataloguing records, surrogates whose primary purpose is to help researchers find relevant records. In the continuum, archival description is instead envisaged as part of a complex series of recordkeeping processes involving the attribution of authoritative metadata from the time of records creation.
This metadata is commonly known as 'recordkeeping metadata', or "any type of data that helps us manage records and make sense of their data content" . McKemmish and Parer have definitively expressed the concept as being "standardised information about the identity, authenticity, content, structure, context and essential management requirements of records" .
A variety of research projects and practically-based initiatives have been concerned with the development of recordkeeping metadata schemes and standards:
- The Pittsburgh Project - one of the earliest archive-based research projects that introduced the concept of metadata for recordkeeping was the University of Pittsburgh's Functional Requirements for Evidence in Recordkeeping project, a project funded by the US National Historic Publications and Records Commission . As part of this project, the project team developed what they called a metadata specification for evidence based on a model known as the Reference Model for Business Acceptable Communications (BAC). The project proposed that digital records should carry a six layer structure of metadata which would contain a 'Handle Layer' that would include an unique identifier and resource discovery metadata, but also very detailed information on terms and conditions of use, data structure, provenance, content and on the use of the record after its creation. This metadata is intended to carry all the necessary information that would allows the record to be used - even when the "individuals, computer systems and even information standards under which it was created have ceased to be" .
- The UBC Project - At aproximately the same time as the Pittsburgh Project was developing its functional requirements for recordkeeping, another North American-based project based at the University of British Columbia (UBC) was investigating "The Preservation of the Integrity of Electronic Records" . The methodological approach of the UBC project was to determine whether the general premises about the nature of records in diplomatics and traditional archival science were relevant and useful in an electronic environment. The UBC project researchers were primarily interested in preserving the integrity of records - as defined through the concepts of completeness, reliability - i.e. the authority and trustworthiness of a record as evidence - and authenticity - i.e. ensuring that the document is what it claims to be . The project also developed a set of eight templates that were intended to help identify the necessary and sufficient components of records in all recordkeeping environments. These templates were a form of metadata scheme. In a further attempt to indentify a core set of recordkeeping metadata elements, Barbara Reed of Monash University has produced a mapping between the Pittsburgh BAC model, the UBC templates and the 'registration' section of part 4 of AS 4390 Australian Standard on Records Management. .
- The National Archives of Australia - an detailed specification for recordkeeping metadata was published by the National Archives of Australia as the Recordkeeping Metadata Standard for Commonwealth Agencies in May 1999 .
- The SPIRT Recordkeeping Metadata Project - this was an Australian project concerned with developing a framework for standardising and defining recordkeeping metadata . The project, amongst other things, attempted to specify and standardise the range of recordkeeping metadata that would be required to manage records in digital environments. It also was concerned with supporting interoperability with generic metadata standards like the Dublin Core and relevant information locator schemes like the Australian Government Locator Service (AGLS). The project developed a framework for standardising and defining recordkeeping metadata, a metadata specification known as the SPIRT Recordkeeping Metadata Scheme (RKMS) and a conceptual mapping against the AGLS standard and other schemes .
- The Netherlands State Archives - Jeff Rothenberg and Tora Bikson have recently produced a report for the National Archives and Ministry of the Interior of the Netherlands entitled Carrying authentic, understandable and usable digital records through time . This document defines a strategy and framework for dealing with digital records, and (in Annex C) contains a model and metadata framework for use within a proposed experimental testbed.
The OAIS Model
Most of the initiatives mentioned so far originated in the library and archives communities. Another important recent development with preservaation metadata implications has been the development of a draft Reference Model for an Open Archival Information System (OAIS). The development of this model is being co-ordinated by the Consultative Committee for Space Data Systems (CCSDS) at the request of the International Organization for Standardization (ISO). The CCSDS is an organisation established by member space agencies to co-ordinate members information requirements. ISO requested that the CCSDS should co-ordinate the development of standards in support of the long-term preservation of digital information obtained from observations of the terrestrial and space environments. The result, the Reference Model for an Open Archival Information System is currently being reviewed as an ISO Draft International Standard.
The document defines a high-level reference model for an Open Archival Information System or OAIS, which is defined as an organisation of people and systems that have "accepted the responsibility to preserve information and make it available for a designated community" . Although development of the model originated in and has been led by the space data community, it is intended that the model is able to be adopted for use by other communities.
The OAIS model is not just concerned with metadata. It defines and provides a framework for a range of functions that are applicable to any archive - whether digital or not. These functions include those described within the OAIS documentation as ingest, archival storage, data management, administration and access. Amongst other things, the OAIS model aims to provide a common framework that can be used to help understand archival challenges and especially those that relate to digital information.
As part of this framework, the OAIS model identifies and distinguishes between the different types of information (or metadata) that will need to be exchanged and managed within an OAIS. Within the draft recommendation, the types of metadata that will be needed are defined as part of what is called a Taxonomy of Information Object Classes . Within this taxonomy, an Archival Information Package (AIP) is perceived as encapsulating two different types of information, some Content Information and any associated Preservation Description Information (PDI) that will allow the understanding of the Content Information over an indefinite period of time. The Content Information is itself divided into a Data Object - which would typically be a sequence of bits - and some Representation Information that is able to give meaning to this sequence. Descriptive Information that can form the basis of finding aids (and other services) can be based on the information that is stored as part of the PDI, but is logically distinct.
The OAIS Taxonomy of Information Object Classes further sub-divided the PDI into four different groups. These were based on some concepts described in the 1996 report on Preserving Digital Information that was produced by the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access (CPA) and the Research Libraries Group (RLG). The task force wrote that "in the digital environment, the features that determine information integrity and deserve special attention for archival purposes include the following: content, fixity, reference, provenance and context" . Accordingly, the OAIS taxonomy divides PDI into Reference Information, Context Information, Provenance Information and Fixity Information.
- Reference Information: the OAIS model defines this as the information that "identifies, and if necessary describes, one or more mechanisms used to provide assigned identifiers for the Content Information". This (rather clumsy) definition indicates that there is a need for the Content Information to be identified and described in some way. The CPA/RLG report suggests that for "an object to maintain its integrity, its wholeness and singularity, one must be able to locate it definitively and reliably over time among other objects". Reference Information would be a logical place to record, for example, unique identifiers assigned both by the OAIS itself and by external agencies, e.g. a Digital Object Identifier (DOI) or an ISBN. It could also be used to store basic descriptive-type information that could be used as the basis for resource discovery, although that would not be its main purpose within the PDI.
- Context Information: this is defined as information that "documents the relationships of the Content Environment to its environment". Again, this rather unhelpful definition can be supplemented by the wider discussion that is provided by the CPA/RLG report. This suggests that 'context' should include information on the technical context of a digital object, e.g. to specify its hardware and software dependencies and to record things like hypertext links in a Web document. Context could also include information relating to the mode of distribution of a particular Digital Object (e.g. whether it is networked or provided on a particular storage device) and its wider social context.
- Provenance Information: within the OAIS taxonomy, Provenance Information refers generally to that information that "documents the history of the Content Information". Provenance is a key concept in the archives profession . The CPA/RLG report says that the "assumption underlying the principle of provenance is that the integrity of an information object is partly embodied in tracing from where it came. To preserve the integrity of an information object, digital archives must preserve a record of its origin and chain of custody". There is a certain overlap with the OAIS taxonomy's definition of Context Information, so Provenance Information is described in the OAIS taxonomy as a "special type of Context Information". While Provenance Information is primarily concerned with supporting the integrity of a Data Object, the information that is recorded could also provide information that could be used to help the management and use of Digital Objects stored within a repository (e.g. administrative metadata). It could also store information about the ownership of intellectual property rights that could be used to manage access to the Content Information of which it forms a part.
- Fixity Information: this - in OAIS terms - refers to any information that documents the particular authentication mechanisms in use within a particular repository. This is information that, like Provenance Information, helps support the integrity and authenticity of the Digital Object. Like 'provenance', terms like 'integrity' and 'authenticity' have been understood and used by archivists (and other groups) for a long time [33, 34]. The CPA/RLG report comments that if the content of an object is "subject to change or withdrawal without notice, then its integrity may be compromised and its value as a cultural record would be severely diminished". Changes can either be deliberate or unintentional, but both will adversely effect the integrity of a Digital Object. Most current technical solutions to fixity problems are based on the computation of checksums that can be used to tell when a resource has been changed or through the production of digital signatures. Presumably, these types of information could be recorded as Fixity Information.
There is no clear understanding, as yet, how the Taxonomy of Object Information Classes defined in the OAIS model is meant to be implemented. It is possible, for example, that it could itself provide a basis for the development of a metadata schema.
Several European library-based projects have expressed an interest in implementing parts of the OAIS model, including its Taxonomy of Object Information Classes:
- The NEDLIB project - the NEDLIB (Networked European Deposit Library) project is funded by the European Commission's Telematics Applications Programme and is led by the Koninklijke Bibliotheek, the National Library of the Netherlands. The project has been developing a architectural framework for what it calls a deposit system for electronic publications (DSEP) that is based on the OAIS model . The project will also be testing emulation-based preservation strategies and building a demonstrator system to test the handling of electronic publications from acquisition to access.
- The Cedars project - The Cedars (CURL Exemplars in Digital Archives) project is a three-year project, funded under Phase III of eLib and managed by the Consortium of University Research Libraries (CURL) . The lead sites in Cedars are the universities of Cambridge, Leeds and Oxford, with expertise being drawn from both computing services and libraries within the three organisations. The project's aim is to address some of the strategic, methodological and practical issues relating to digital preservation. These issues are being addressed in three main project strands; one looking at digital preservation strategies and techniques (including emulation); another concerned with collection development and rights management issues; and a third interested in the metadata required to adequately preserve digital information objects . Cedars has adopted a distributed archive architecture based on an implementation of the OAIS model . In addition, the project is attempting to developing a preservation metadata schema that can be tested within the project demonstrators. The development of this schema has been informed by the OAIS model and the taxonomy of information object classes that it identifies.
For a variety of reasons, this column has concentrated on identifying relevant projects and initiatives rather than on describing any of them in detail. It is suggested that those with a deeper interest in the subject would profit by following-up some of the hypertext links listed in the References section below.
- Day, M., 'Extending metadata for digital preservation.' Ariadne (Web version), 9, May 1997.
- Dublin Core Metadata Initiative.
- Day, M., Metadata for preservation. CEDARS project document AIW01. Bath: UKOLN, UK Office for Library and Information Networking, 1998.
- Ross, S., 'Consensus, communication and collaboration: fostering multidisciplinary co-operation in electronic records.' In: Proceedings of the DLM-Forum on Electronic Records, Brussels, 18-20 December 1996. INSAR: European Archives News, Supplement II. Luxembourg: Office for Official Publications of the European Communities, 1997, pp. 330-336; here p. 331.
- Rothenberg, J., Avoiding technological quicksand: finding a viable technical foundation for digital preservation. Washington, D.C.: Council on Library and Information Resources, 1999, p. 27.
- Ross, S., 'Consensus, communication and collaboration,' op. cit. p. 331.
- Bearman, D., Electronic evidence: strategies for managing records in contemporary organizations. Pittsburgh, Pa.: Archives and Museum Informatics, 1994, p. 302.
- Rothenberg, J., Avoiding technological quicksand, op cit., pp. 13-16.
- Bearman, D, Reality and chimeras in the preservation of electronic records. D-Lib Magazine, 5 (4), April 1999.
- Cedars 2.
- Lynch, C., 'Canonicalization: a fundamental tool to facilitate preservation and management of digital information.' D-Lib Magazine, 5 (9), September 1999.
- RLG Working Group on Preservation Issues of Metadata, Final report. Mountain View, Calif.: Research Libraries Group, May 1998.
- Cameron, J. and Pearce, J., 'PANDORA at the crossroads: issues and future directions.' In: Sixth DELOS Workshop: Preservation of Digital Information, Tomar, Portugal, 17-19 June 1998. Le Chesnay: ERCIM, 1998, pp. 23-30.
- National Library of Australia, Request for Tender for the provision of a Digital Collection Management System. Attachment 2 - Logical data model. RFT 99/11. Canberra: National Library of Australia, 23 August 1999.
- Dollar, C.M., Archival theory and information technologies: the impact of information technologies on archival principles and methods. Macerata: University of Macerata Press, 1992.
- O'Shea, G., 'Keeping electronic records: issues and strategies.' Provenance, 1 (2), March 1996.
- McKemmish, S. and Parer, D., 'Towards frameworks for standardising recordkeeping metadata.' Archives and Manuscripts, 26, 1998, pp. 24-45; here, p. 39.
- McKemmish, S., Cunningham, A. and Parer, D., Metadata mania. Paper given at: Place, Interface and Cyberspace: Archives at the Edge, the 1998 Annual Conference of the Australian Society of Archivists, Fremantle, Western Australia, 6-8 August 1998.
- McKemmish and Parer, op. cit., p. 38.
- Duff, W., 'Ensuring the preservation of reliable evidence: a research project funded by the NHPRC.' Archivaria, 42, 1996, pp. 28-45. See also the project's Web site at:
- Bearman, D. and Sochats, K., Metadata requirements for evidence. Pittsburgh, Pa.: University of Pittsburgh, School of Information Science, 1996.
- Duranti, L. and MacNeil, H., 'The protection of the integrity of electronic records: an overview of the UBC-MAS research project.' Archivaria, 42, 1996, pp. 46-67. See also the project's Web site at:
- Duranti, L. 'Reliability and authenticity: the concepts and their implications.' Archivaria, 39, 1995, pp. 5-10.
- Reed, B., 'Metadata: core record or core business.' Archives and Manuscripts, 25 (2), 1997, pp. 218-241.
- National Archives of Australia, Recordkeeping metadata standard for commonwealth agencies, version 1.0. Canberra: National Archives of Australia, May 1999.
- Acland, G., Cumming, K. and McKemmish, S., The end of the beginning: the SPIRT Recordkeeping Metadata Project. Paper given at: Archives at Risk: Accountability, Vulnerability and Credibility, the1999 Annual Conference of the Australian Society of Archivists, Brisbane, Queensland, 29-31 July 1999.
- Ibid.. See also: McKemmish, S. and Acland, G., Accessing essential evidence on the Web: towards an Australian recordkeeping metadata standard. Paper given at: AusWeb99, the Fifth Australian World Wide Web Conference, Ballina, New South Wales, 17-20 April 1999.
- Rothenberg, J. and Bikson, T., Carrying authentic, understandable and usable digital records through time: report to the Dutch National Archives and Ministry of the Interior. The Hague: Rijksarchiefdienst, 6 August 1999.
- Consultative Committee for Space Data Systems, Reference model for an Open Archival Information System (OAIS), Red Book, Issue 1. CCSDS 650.0-R-1. Washington, D.C.: National Aeronautics and Space Administration, p. 1-11
- Ibid. pp. 4-21 - 4-27.
- Garrett, J. and Waters, D., (eds.), Preserving digital information: report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access and the Research Libraries Group. Washington, D.C.: Commission on Preservation and Access, 1996.
- Bearman, D. and Lytle, R.H, 'The power of the principle of provenance.' Archivaria, 21, 1985-86, pp. 14-27.
- Duranti, L., Eastwood, T. and MacNeil, H., The preservation of the integrity of electronic records. Vancouver: University of British Columbia, School of Library, Archival & Information Studies, 1996.
- Lynch, C.A., 'Integrity issues in electronic publishing.' In: Peek, R.P. and Newby, G.B., (eds.), Scholarly publishing: the electronic frontier. Cambridge, Mass.: MIT Press, 1996, pp. 133-145.
- Werf-Davelaar, T. van der, 'Long-term preservation of electronic publications: the NEDLIB project.' D-Lib Magazine, 5 (9), September 1999.
- Russell, K., 'The JISC Electronic Libraries Programme.' Computers and the Humanities, 32, 1998, pp. 353-375.
- Russell, K., 'CEDARS: long-term access and usability of digital resources: the digital preservation conundrum.' Ariadne, 18, December 1998.
- Russell, K. and Sergeant, D., 'The Cedars project: implementing a model for distributed digital archives.' RLG DigiNews, 3 (3), 15 June 1999.
Cedars is a Consortium of University Research Libraries (CURL) project funded by the Joint Information Systems Committee (JISC) of the UK higher education funding councils through its Electronic Libraries Programme (eLib).
UKOLN is funded by the Library and Information Commission, the JISC, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath, where it is based.
Author detailsMichael Day
UKOLN: the UK Office for Library and Information Networking
University of Bath
Bath BA2 7AY, UK