Web Magazine for Information Professionals

Metadata Corner

Michael Day reports from Tomar, Portugal, on the DELOS6 Workshop.

Since 1996 the DELOS Working Group [1] has organised a series of workshops with the intention of promoting research into the further development of digital library technologies.

Castelo dos Templários, Tomar

The sixth workshop in the DELOS series was held in the Hotel dos Templários, Tomar (Portugal) on the 17th - 19th June 1998 [2]. Tomar is a small town about 140 km. north of Lisbon and is famous for its Templar castle and the magnificent Convento de Christo, an UNESCO World Heritage Site [3]. The workshop was jointly organised by the DELOS Working Group and project NEDLIB (Networked European Deposit Library), a project that is funded by the European Commission’s Telematics Application Programme [4]. It had about 40 attendees, mostly from Europe but including some speakers from the United States and Australia. Local organisation of the workshop was undertaken by INESC (Instituto de Engenharia de Sistemas e Computadores) [5] and the Portuguese National Library [6], with the support of the Central Library of the Instituto Superior Técnico (IST) in Lisbon [7].

Previous DELOS workshops had covered issues like metadata, multilingual information retrieval and image indexing. The sixth workshop concerned the preservation of digital information and the workshop presentations provided a good overview of current initiatives and projects in this area.

Strategic issues

Two introductory papers described strategic approaches to preservation being developed in the US and UK. Hans Rütimann, International Program Officer of the Council on Library and Information Resources (CLIR) [8], introduced work carried out under the auspices of the Commission on Preservation and Access (CPA) and the Digital Library Federation. His paper, for example, summed-up the main conclusions and recommendations of the 1996 report of the Task Force on Archiving of Digital Information commissioned jointly by the CPA and the Research Libraries Group (RLG) which has done so much to set the agenda for current work on digital preservation [9]. There are three main strategies for digital preservation: migration across changing software and hardware platforms, emulation, and preserving obsolete hardware and software. Currently migration is the strategy most commonly proposed. Rütimann reported that Jeff Rothenberg had argued that over longer time periods, emulation was likely to be more effective and CLIR had funded him to further develop and test his ideas on emulation as a preservation strategy. His report should be available from the CLIR website later this year. There was also a viewing of the important CLIR videotape "Into the future: on the preservation of knowledge in the electronic age" which clearly outlines the scope and importance of the digital preservation problem [10]. "Into the Future" has been widely shown on the public service TV channels in the US and has led to US press coverage of digital archiving issues, for example in The New York Times [11]. CLIR is also the administrative home of the Digital Library Federation (DLF) whose primary goal is "to establish the necessary conditions for creating, maintaining, expanding, and preserving a distributed collection of digital materials accessible to scholars and a wider public" [12].

Neil Beagrie introduced a study produced by the Arts and Humanities Data Service (AHDS) on developing a strategic policy framework for the creation and preservation of digital resources [13]. This had been funded by the Joint Information Systems Committee (JISC) of the UK higher education funding councils as part of a series of studies commissioned by the Digital Archiving Working Group (DAWG). The study (amongst other things) noted that different stakeholders become involved at different stages of the life-cycle of digital resources and that this has a significant effect upon the potential for (and the cost of) preserving these resources. Data creators, for example, very rarely consider how the resources that they create should be managed in the long-term. Organisations concerned with digital preservation, on the other hand, often have virtually no influence over how resources are created. It is important, therefore, for all stakeholders to be aware of how their own activities will impact on the life-cycle of a particular resource and will also need to understand the interests and involvement of other stakeholders. It is particularly important to raise awareness of preservation issues at the data creation stage of the life-cycle. The study also emphasised the importance of co-operative activity in the field of digital preservation, noting that "no single agency is likely to be able to undertake the role of preserving all digital materials within its purview …".

Metadata issues

There is currently a growing awareness of the importance of metadata to digital preservation [14]. This demonstrates that the library and information community are beginning to see the usefulness of metadata not just for resource discovery, important although that is, but as a help to the ongoing management of digital (specifically networked) resources including long-term preservation. For example, it is important to identify what metadata would be needed to enable the emulation of digital information created on obsolete software and hardware platforms. Collecting metadata would also be important part of migration strategies. Evidence of the growing interest in metadata and digital preservation is the recent publication of the final report of a RLG Working Group on the Preservation Issues of Metadata [15].

Several papers at the workshop covered these issues. Michael Day (UKOLN) introduced the UK CEDARS (CURL Exemplars in Digital ARchiveS) project that is funded by JISC under the Electronic Libraries (eLib) Programme [16]. His paper gave a brief outline of the project aims and described in more detail the work that CEDARS propose with regard to preservation metadata.

Alan R. Heminger of the US Air Force Institute of Technology proposed the adoption of a Digital Rosetta Stone (DRS) model. He proposed the creation of a meta-knowledge archive (or archives) that would maintain sufficient knowledge about how data had been stored and used in order to enable the future recovery of data from obsolete storage devices and file formats. Heminger attempted to demonstrate the theory behind the model with an example based on obsolete 8-track punched paper tape. The DRS approach, if applied consistently to all new technologies, would be one way of being able to reconstruct information from obsolete storage devices and file formats but Heminger himself admits that the development of a meta-knowledge archive would be a time intensive and expensive task.

Dave MacCarn (WGBH Educational Foundation) outlined the concept of a Universal Preservation Format (UPF) for digital video and film which proposes the use of a platform-independent format that will help make accessible a wide range of data types [17]. Interestingly, in his presentation, MacCarn advocated storing digital data on an extremely compact hybrid analogue media (like the HD-ROSETTA disks [18] developed by Norsam Technologies) for long-term preservation. The stored information would include metadata that would describe how to recover the data stored on the medium and to enable the construction of reading devices.

Collection management issues

The workshop also raised issues about collection management policies with regard to digital preservation. In the digital information environment it is far from clear who should be responsible for implementing collection management policies and the precise criteria that these policies should include. Some Internet subject services or gateways, for example, use defined quality criteria for selecting what is included in their databases [19], but it is unlikely whether these criteria would be suitable for (say) a national library’s preservation policy. With current Web technologies, it is possible to bypass the selection issue altogether. Brewster Kahle, for example, has founded the Internet Archive Foundation that takes periodic snapshots of all parts of the Web that are freely and technically accessible [20] [21]. It is a relatively easy task to develop (or adapt) software robots which are able to collect the entire Web or at least particular domains of it.

At Tomar, a paper by Inkeri Salonharju (Helsinki University Library) and Kirsti Lounamaa (Centre for Scientific Computing) revealed that the Finnish EVA project [22] is using technology developed for the Nordic Web Index (NWI) [23] to harvest and index all Web documents located in the Finnish domain. Copyright considerations mean that there is currently no public access to the archive produced by EVA, although it is hoped that in the longer-term legal deposit legislation for digital publications might provide for this. The Swedish Royal Library’s Kulturarw3 project has a similar approach [24]. The project’s basic idea is to automate as much as possible and build robots to download ‘everything’ following Kahle’s Internet Archive model. Kulturarw3 has chosen this approach partly because of the difficulty of devising suitable ‘futureproof’ criteria but, like EVA, is aware that implementing selection policies is currently expensive in terms of time and personnel.

By contrast, the National Library of Australia’s PANDORA project [25] started with the principle of selectivity. The workshop paper delivered by Judith Pearce (NLA) noted that no attempt was being made to capture the entire Australian domain. PANDORA, therefore, used selection guidelines developed by a Selection Committee on Online Australian Publications (SCOAP) [26]. These guidelines clearly state that networked publications that are not authoritative or do not have reasonable research value would not usually be selected for preservation. PANDORA is, however, only currently collecting electronic publications that are in the public domain. The project is aware, however, that the NLA needs to consider the issues surrounding the management of commercially published digital information. Currently the library would view commercial publications held in the PANDORA archive as ‘secondary resources’; i.e. that these archived publications should be used only when they are no longer available from the publisher. The NLA is aware that this policy may have to be modified once major Australian publishers move into the Internet publishing market.

There is, therefore, some conflict between the automated gathering models proposed in the Nordic digital preservation projects and the NLA’s development of appropriate collection management policies. The comprehensive Nordic approach may work in relatively small domains but other projects, including CEDARS, will work on developing collection management strategies for preservation. Developing and applying appropriate collection management policies may be time-consuming and expensive in the short-term but attempting to collect all digital documents may not be sustainable (or even desirable) in the longer-term.

Digitization issues

Some of the other workshop papers described projects or programmes related to digitisation. These projects are primarily interested in giving access to digitised objects rather than in digital preservation itself. For example, Milena Dobreva of the Institute of Mathematics and Informatics, Bulgarian Academy of Sciences outlined the current situation with regard to the creation of digitised collections of cultural heritage resources in Bulgaria. Kostas Chandrinos (Institute of Computer Science, Foundation for Research and Technology Hellas (ICS-FORTH) [27], Greece) described the Web-based architectures being used in the ARHON (A Multimedia System for Archiving, Annotation and Retrieval of Historical Documents) project. Other presentations were concerned with the digitisation of audio-visual resources.

Digital "salvage archiving" and e-mail records

The most entertaining paper was by David Wallace of the University of Michigan School of Information on the US PROFS-related litigation concerning the archival preservation of electronic mail records emanating from the US National Security Council and the Executive Office of the President. This case followed on from investigation of the "Iran-Contra Affair" and the discovery of important e-mail communications between Oliver North and National Security Adviser John Poindexter [28]. One of the interesting outcomes of the long legislative battle (which is still continuing) was that in 1993 the US National Archives and Records Administration (NARA) [29] was told to take immediate action to preserve the electronic records that had been the subject of the case. The material consisted of around 5,700 backup tapes (in various formats) and over 150 hard disk drives from personal computers. These materials eventually found their way to NARA’s Center for Electronic Records (CER) where despite many problems they were mostly successfully copied. Wallace pointed out that this success was, however, at the cost of the rest of the work of the CER. The time and resources spent on PROFS "salvage" work meant that CER had to temporarily stop accessioning electronic records from the rest of the government. Wallace concluded that digital "salvage archiving" would probably require resources, both technical and economic, that would not be available in most institutions concerned with preservation and that it would not be an effective way of proceeding unless substantial additional resources are able to be concentrated on the salvage effort.

Another interesting outcome from this litigation is that it is now accepted in the US that electronic mail software can produce official government records. It has been shown that computer systems will need to accomodate an electronic record-keeping functionality during the systems design stage if later archival processing and preservation is to be accomplished in a cost-effective and timely manner. It has also demonstrated that policies that rely on printing selected e-mail records to paper can omit significant systems metadata and could violate US public records legislation if the printout is then treated as the official record and electronic versions deleted.

Conclusions

Inevitably other issues were raised over the three days of the workshop. There was a widespread acceptance that the preservation of digital information does not just involve solving technical problems but requires scalable solutions to organisational, economic and political problems as well. Such solutions will require a strategic approach and collaboration between institutions at a national and international level.

Convento de Christo, Tomar

The conference ended after lunch on the 19th June. Transfers were arranged to Lisbon where some of the workshop attendees got the opportunity to visit Expo 98 [30] or the capital city itself.

Acknowledgements

The authors would like to thank Kelly Russell (CEDARS Project Manager) and the CEDARS project for helping to fund Michael Day’s participation in the DELOS workshop. The authors have also produced a shorter review of the same workshop in the July/August 1998 issue of D-Lib Magazine [31].

References

  1. DELOS Working Group
    URL: <http://www.iei.pi.cnr.it/DELOS/>
  2. Sixth DELOS Workshop: Preservation of Digital Information
    URL: <http://crack.inesc.pt/events/ercim/delos6/>
  3. Luís Maria Pedrosa dos Santos Graça, Convento de Christo. Lisboa-Mafra: Edição ELO, 1994.
  4. Project NEDLIB
    URL: <http://www.konbib.nl/nedlib/>
  5. INESC: Instituto de Engenharia de Sistemas e Computadores
    URL: <http://www.inesc.pt/>
  6. Biblioteca Nacional
    URL: <http://www.biblioteca-nacional.pt/>
  7. Instituto Superior Técnico (IST)
    URL: <http://www.ist.utl.pt/>
  8. Council on Library and Information Resources (CLIR)
    URL: <http://www.clir.org/>
  9. Preserving digital information. Report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access and the Research Libraries Group. Washington, D.C.: Commission on Preservation and Access, May 1996.
    URL: <http://www.rlg.org/ArchTF/>
  10. Into the future: on the preservation of knowledge in the electronic age. A film by Terry Sanders, produced in association with the Commission on Preservation and Access (a program of the Council on Library and Information Resources) and the American Council of Learned Societies.
    URL: <http://www.clir.org/programs/otheractiv/intro.html>
  11. Stephen Manes, Time and technology threaten digital archives. Science Desk. New York Times, 7 April 1998.
    URL: <http://archives.nytimes.com/archives/>
  12. Digital Library Federation
    URL: <http://www.clir.org/programs/diglib/diglib.html>
  13. Neil Beagrie and Daniel Greenstein, A Strategic policy framework for creating and preserving digital collections. Final draft. London: Arts and Humanities Data Service, July 1998.
    URL: <http://ahds.ac.uk/manage/framework.htm>
  14. Michael Day, Extending metadata for digital preservation. Ariadne, No. 9, May 1997.
    URL: <http://www.ariadne.ac.uk/issue9/metadata/>
  15. RLG Working Group on Preservation Issues of Metadata, Final report. Mountain View, Calif.: Research Libraries Group, May 1998.
    URL: <http://www.rlg.org/preserv/presmeta.html>
  16. CURL Exemplars in Digital ARchiveS (CEDARS)
    URL: <http://www.leeds.ac.uk/cedars/>
  17. Universal Preservation Format
    URL: <http://info.wgbh.org/upf/>
  18. Norsam Technologies, HD-ROSETTA Archival Storage System.
    URL: <http://www.norsam.com/rosetta.htm>
  19. Paul Hofman, Emma Worsfold, Debra Hiom, Michael Day and Angela Oehler, Specification for resource description methods: 1. Selection criteria for quality controlled information gateways. DESIRE Deliverable 3.2 (2). Bath: UKOLN, May 1997.
    URL: <http://www.ukoln.ac.uk/metadata/desire/quality/>
  20. Brewster Kahle, Preserving the Internet. Scientific American, Vol. 276, No. 3, March 1997, pp. 72-73.
  21. Peter Lyman and Brewster Kahle, Archiving digital cultural artifacts: organizing an agenda for action, D-Lib Magazine, July/August 1998.
    URL: <http://www.dlib.org/dlib/july98/07lyman.html>
  22. EVA - the acquisition and archiving of electronic network publications
    URL: <http://renki.lib.helsinki.fi/eva/english.html>
  23. Nordic Web Index (NWI)
    URL: <http://nwi.lub.lu.se/?lang=en
  24. Kulturarw3
    URL: <http://kulturarw3.kb.se/html/projectdescription.html>
  25. Preserving and Accessing Networked DOcumentary Resources of Australia (PANDORA)
    URL: <http://www.nla.gov.au/pandora/>
  26. Selection Committee on Online Australian Publications (SCOAP), Guidelines for the selection of online Australian publications intended for preservation by the National Library. Canberra: National Library of Australia, January 1997.
    URL: <http://www.nla.gov.au/scoap/scoapgui.html>
  27. Institute of Computer Science, Foundation for Research and Technology - Hellas (ICS-FORTH)
    URL: <http://www.ics.forth.gr/ICS_home/>
  28. David Bearman, The implications of Armstrong v. Executive of the President for the archival management of electronic records. American Archivist, Vol. 56, 1993, pp. 674-689.
  29. National Archives and Records Administration (NARA)
    URL: <http://www.nara.gov/>
  30. Expo ‘98, Lisboa
    URL: <http://www.expo98.pt/en/homepage.html>
  31. Michael Day and Neil Beagrie, Sixth DELOS Workshop - Preservation of digital information, June 17-19, 1998, Tomar Portugal. D-Lib Magazine, July/August 1998
    URL: <http://www.dlib.org/dlib/july98/07clips.html#DELOS>

Author details

Michael Day
Research Officer
UKOLN: the UK Office for Library and Information Networking
University of Bath
Bath BA2 7AY, UK
E-mail: m.day@ukoln.ac.uk
Web page: http://www.ukoln.ac.uk/

Neil Beagrie
Collections and Standards Development Officer
Arts and Humanities Data Service Executive
King’s College London
London WC2R 2LS
E-mail: neil.beagrie@ahds.ac.uk
Web page: http://ahds.ac.uk/bkgd/exec.html