Web Magazine for Information Professionals

ECDL-2003 Web Archiving

Michael Day reports on the 3rd ECDL Workshop on Web Archives held in Trondheim, August 2003.

On 21 August 2003, the 3rd ECDL Workshop on Web Archives [1] [2] was held in Trondheim, Norway in association with the 7th European Conference on Digital Libraries (ECDL) [3]. This event was the third in a series of annual workshops that have been held in association with the ECDL conferences held in Darmstadt [4] and Rome [5]. These earlier workshops primarily focused on the activities of legal deposit libraries and the collection strategies and technologies being used by Web archiving initiatives [6]. At the start of the workshop, Julien Masanès of the Bibliothèque nationale de France (BnF) welcomed participants on behalf of the organising committee. He noted that, for the first time, the workshop had issued a call for papers and that, following the peer-review process, ten papers had been accepted for presentation at the workshop. These had been grouped into three broad themes: the design and use of persistent identifiers, the tools being developed and used to support Web archiving, and reports on some experiences.

Identifiers

The first three presentations dealt with the ever-important issue of identifiers. John Kunze of the California Digital Library began with an introduction to the ARK (Archival Resource Key) persistent identifier scheme [7]. Before introducing the scheme itself, Kunze first noted some problems with existing approaches to persistent identification. These tend to be based on the concept of indirection, e.g. names that resolve to URLs at the point of access. In themselves, however, identification schemes based on indirection cannot guarantee persistent access because this is dependent on the stewardship of the objects themselves. Kunze proposes a new approach to persistent identification based on statements of institutional commitment. The ARK identifier is a means of linking to three things, the object itself, its metadata, and a commitment statement from its current provider. The ARK scheme has four components: an (optional) name mapping authority hostport (NMAH) - typically a hostname, the ARK label, a name assigning authority number (NAAN) that would be a globally unique number assigned by the name assigning authority, and the Name assigned by the name mapping authority. An example of an ARK URL (taken from Kunze’s full paper) would be:

http://ark.cdlib.org/ark:/13030/ft4w10060w

The NMAH (e.g. in this example: ark.cdlib.org/) is optional, so that it can be ignored or replaced when technologies and/or service providers change. Full details of the ARK scheme have been published as an Internet-Draft [8].

This was followed by a presentation on the identification of network accessible documents by Carol van Nuys and Ketil Albertsen of the National Library of Norway [9]. The context was the National Library’s Paradigma (Preservation, Arrangement & Retrieval of Assorted DIGital MAterials) Project, which is investigating procedures for the selection, registration, and long-term preservation of digital objects [10][11]. Albertsen introduced some of the problems that were associated with identification at different levels of granularity (e.g. items in anthologies) and looked at ways of defining identifiable units at all four levels of IFLA’s Functional Requirements for Bibliographic Records (FRBR) model [12]. As part of this, it was proposed that objects needed identifiers to be assigned at higher abstract levels, i.e. the expression and work levels in FRBR.

In the next presentation, Eva Müller of the Electronic Publishing Centre of Uppsala University Library (Sweden) described how the URN:NBN (Uniform Resource Name : National Bibliography Number) had been implemented within the DiVA (Digitala vetenskapliga arkivet) system [13]. The DiVA (Digital Scientific Archive) system [14] was developed at Uppsala and is used to publish various types of scientific output - e.g., doctoral theses, working papers, etc. The system is currently used by Uppsala University and four other co-operating Swedish universities (Stockholm, Umeå, and Örebro universities, Södertörns högskola) and is soon to be used by one Danish one (Århus University). The DiVA system is also an archival store, and the project has developed an XML schema (the DiVA Document Format) that contains metadata and those parts of the full text that can be converted to XML [15]. Müller’s presentation described the workflows that allow the DiVA system to reuse and enhance information provided by authors, (through the use of document templates), assign URN:NBNs to documents, and ‘deposit’ the latter with the Royal Library, the National Library of Sweden. The assignment of URN:NBN [16] identifiers is done locally, and DiVA has been assigned sub-domains that can be used by project participants, e.g.: URN:NBN:se:uu:diva for Uppsala University. These identifiers are a key part of the workflow that DiVA uses to deposit ‘archiving packages’ with the national library. These packages contain the object itself (e.g. a PDF or PostScript file), the DiVA Document Format (providing metadata and as much content as possible in XML) and its XML schema.

Tools

The following two presentations looked at tools being developed for Web archiving. The first was an outline of the Nordic Web Archive toolset by Thorsteinn Hallgrímsson of the National and University Library of Iceland and Sverre Bang of the National Library of Norway [17]. The Nordic Web Archive (NWA), a co-operative project of the national libraries of Denmark, Finland, Iceland, Norway, and Sweden, is developing a modular set of software tools known as the NWA toolset for improving the management of and access to harvested Web archives [18]. The toolset has three main components: a document retriever (the interface to the Web archive), an exporter (prepares objects for indexing), and an access module. The exporter outputs data in the XML-based NWA Document Format, which can then be sent to the indexer. The search engine currently supported by the NWA project is a commercial product developed by FAST [19], but it is hoped that this will be replaced by an open-source alternative. There remained concerns about usability, the increasing size of the Web, relevance, and scalability, but Hallgrímsson concluded by stressing that Web archiving and indexing should be seen as a tremendous - but challenging - opportunity, not as a problem or liability.

After, this Ketil Albertsen of the National Library of Norway returned to introduce the Paradigma Web Harvesting Environment [20]. The Paradigma Project has produced a workflow for the harvesting and management of a Web archive. One key issue is the balance between the automatic processing of objects and manual intervention. The project has estimated that, at most, only 1% of network documents would be able to be described in the bibliographic system, so the challenge is to be able to analyse automatically, group and rank objects so that human effort can be focused on the small proportion that requires manual intervention.

Experiences

Before lunch there was some time for some short verbal updates from organisations not represented on the programme. These included the National Library of New Zealand (NLNZ), the National Archives of the UK, the National Diet Library of Japan, the National and University Library, Ljubljana (Slovenia) the Royal Library, Copenhagen (Denmark), Die Deutsche Bibliothek, and the National Library of Portugal. So, for example, Steve Knight of the NLNZ noted that New Zealand would soon have a revised mandate for legal deposit that would include online materials and that the library would be experimenting with both selective and harvesting approaches to collecting the New Zealand domain. The UK National Archives announced that a contract had recently been signed with the Internet Archive [21] to collect 50 UK Government Web sites at varying frequencies. The first results of this are now available from the National Archives Web pages [22].

After lunch, the opening presentation was an attempt to characterise the Portuguese Web space [23] by Daniel Gomez of the University of Lisbon. He outlined the results of a harvest of the Portuguese Web undertaken in April/May 2003. The researchers configured the Viúva Negra crawler [24] to gather information about the Portuguese Web - defined as the .pt domain, sites in .com, .org, etc. in Portuguese, and those with incoming links to the .pt domain - and ‘seeded’ it with 112,146 URLs. The crawler visited 131,864 sites, processed over 4 million URLs, and downloaded 78 Gb. of data. Analysis of the results concerned the length of URLs (between 5 and 1386 characters), last modified dates (53.5% unknown), MIME types (95% text/HTML), language distribution (73% of documents in the .pt domain are in Portuguese), the presence of meta-tags, and content replication.

The final three presentations looked at the experiences of particular Web archiving initiatives. First, Gina Jones of the Library of Congress (USA) talked about the challenges of building thematic Web collections [25] based on the experiences of developing the September 11 Web Archive [26] and the Election 2002 Web Archive [27]. Both of these collections were the result of collaborations between the Library of Congress, the Internet Archive and the research group Archivist.org [28]. One group of challenges related to scoping the coverage of the collections, e.g. the problems of selecting the initial sets of ‘seed URLs,’ defining the number of links to be followed, both internally and externally, and the periodicity of collection. Others related to what processes had to be followed after harvesting: the level of cataloguing, the maintenance of large amounts of data in ways that are scalable and accessible, and providing meaningful access. For example, each of the Library of Congress’s thematic collections will have a collection-level record, while all sites in the Election 2002 collection and around 10% of the September 11 collection will be described using MODS (Metadata Object Description Schema) [29].

At the other end of the scale, Jennifer Gross of the Institute of Chinese Studies at the University of Heidelberg (Germany) next introduced the Digital Archive for Chinese Studies (DACHS) [30]. Scholars of Chinese studies initiated this project because they had an awareness of the importance of the Web in China, e.g. for communication between dissidents, and were aware of its fragility. Because of its small size and relatively limited focus, DACHS can make use of the background knowledge of Chinese studies specialists, who know exactly what might be of interest for other scholars. After selection, sites are collected using non-specialist tools (e.g. Web browsers), some metadata is created, access is then provided through the Institute’s library catalogue and an index and full-text search available on the DACHS Web pages [31].

The final presentation concerned “Archiving the Czech Web” [32] and was given by Petr Žabicka of the Moravian Library in Brno (Czech Republic). He first described a project funded between 2000 and 2001 that decided on a ‘breadth first’ approach to harvesting the .cz domain, defined some collection criteria, and a pilot crawl undertaken in September 2001 using the NEDLIB Harvester [33]. A second project called WebArchiv [34] initiated a second crawl in April 2002. This was more successful than the first crawl, but highlighted problems with the harvester, chiefly relating to the length of time it took to download pages.

The final workshop discussion centred on the continuing need for initiatives and organisations concerned with Web archiving to co-operate. Part of the discussion was based on the recent establishment of an International Internet Preservation Consortium (IIPC) - a group that links some national libraries and the Internet Archive - and which will act as an focus for the establishment of working groups and projects to deal with specific issues. In addition, there were a number of issues raised that need to be considered in more detail. One of these concerns the development of better tools (new crawlers, software capable of indexing large-scale collections, etc.), which would need to be done collaboratively and with significant support from funding bodies. Another issue that will need to be faced soon is how users - both researchers and the general public - might be able in the future to access seamlessly Web collections fragmented by nationality, subject and resource type. Other important factors will be legal constraints on Web archiving, the (so-called) ‘deep Web,’ and the exact role of replication in preservation strategies.

One paper accepted for the workshop (on Political Communications Web Archiving) could not be delivered due to illness. However, the full texts of all papers are available from the workshop Web site [1].

Summing-up

A year appears to be a very long time in Web archiving terms. I felt that the Trondheim workshop had moved well beyond the previous workshops’ focus on collection strategies and crawler technologies. Other participants will undoubtedly come up with other things, but I felt that there were a couple of points that came out of the nine presentations and workshop discussion.

Firstly, the focus of many presentations was not just on the collection of Web sites, but on what needs to be done after this to facilitate access and use. One strand of this concerned requirements for descriptive data (metadata), identifiers and indexing technologies. In many cases, initiatives are looking for a balance between what can be done in an automated way - essential when we are considering such large volumes of data - and what requires human intervention. The Paradigma Project, for example, is looking for ways of automatically identifying those resources that need manual input.

A second, if related, point is the need for Web archives to take account of granularity. Web objects are granular in a number of different ways and this influences the levels at which metadata needs to be captured or created. For example, the Library of Congress has decided to catalogue its thematic collections at collection level, with additional metadata (using MODS) at selected lower levels of granularity. The Paradigma Project has considered the granularity of objects as they relate to the FRBR model.

Finally, a more general point. The increased discussion of ‘deep Web’ sites has led me to think in more detail about what we mean when we talk about ‘Web archiving.’ I was struck when reading the BrightPlanet white paper on the deep Web [35] just how many of the largest sites listed were large databases that offer Web interfaces because this is the way that most of their current users want to gain access to them. Many of these (e.g. NASA’s Earth Observing System Data and Information System (EOSDIS), the US Census, Lexis-Nexis, INSPEC, etc.) existed long before the Web was developed, and presumably the interfaces to these will in due course migrate to whatever access technologies succeed it. Just because something can be accessed via the Web, does that mean that it should be within the scope of Web archiving initiatives? I’m not suggesting that libraries and other cultural heritage organisations should ignore the preservation needs of these key resources, just that Web archiving needs to be seen as just one part of more comprehensive digital preservation strategies.

References

  1. 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
  2. Nuys, C. van. (2003). “ECDL 2003 workshop report: Web archives.” D-Lib Magazine, 9(9), September.
    Available at: http://www.dlib.org/dlib/september03/09inbrief.html#VAN_NUYS
  3. 7th European Conference on Digital Libraries (ECDL 2003), Trondheim, Norway, 17-22 August 2003.
    Available at: http://www.ecdl2003.org/
  4. What’s next for Digital Deposit Libraries? ECDL Workshop, Darmstadt, 8th September 2001.
    Available at: http://bibnum.bnf.fr/ecdl/2001/
  5. 2nd ECDL Workshop on Web Archiving, Rome, 19th September 2002.
    Available at: http://bibnum.bnf.fr/ecdl/2002/
  6. Day, M. (2002). “2nd ECDL Workshop on Web Archiving.” Cultivate Interactive, 8, November.
    Available at: http://www.cultivate-int.org/issue8/ecdlws2/
  7. Kunze, J.A. (2003). “Towards electronic persistence using ARK identifiers.” 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
    Also available at: http://ark.cdlib.org/arkcdl.pdf
  8. Kunze, J., & Rodgers, R.P.C. (2003). “The ARK Persistent Identifier Scheme.” IETF Internet-Draft, July.
    Available at: http://www.ietf.org/internet-drafts/draft-kunze-ark-06.txt
  9. Nuys, C. van, & Albertsen, K. (2003). “Identification of network accessible documents: problem areas and suggested solutions.” 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
  10. National Library of Norway, Paradigma project.
    Available at: http://www.nb.no/paradigma/
  11. Nuys, C. van. (2003). “The Paradigma Project.” RLG DigiNews, 7(2), 15 April.
    Available at: http://www.rlg.org/preserv/diginews/v7_n2_feature2.html
  12. IFLA Study Group on the Functional Requirements for Bibliographic Records. (1998). Functional requirements for bibliographic records: final report. UBCIM Publications, New series., Vol. 19. Munich: K. G. Saur.
    Available at: http://www.ifla.org/VII/s13/frbr/frbr.htm
  13. Müller, E., Klosa, U., Hansson, P. & Andersson, S. (2003). “Archiving workflow between a local repository and the national archive: experiences from the DiVA project.” 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
    Also available at: http://publications.uu.se/epcentre/presentations.xsql
  14. DiVA portal.
    Available at: http://www.diva-portal.se/
  15. Hansson, P., Klosa, U., Müller, E., Siira, E., & Andersson, S. (2003). “Using XML for long-term preservation: experiences from the DiVA project.” 6th International Symposium on Electronic Theses and Dissertations (ETD 2003), Berlin, Germany, 20-24 May 2003.
    Available at: http://edoc.hu-berlin.de/etd2003/hansson-peter/
  16. Hakala, J. (2001). “Using National Bibliography Numbers as Uniform Resource Names.” RFC 3188, October.
    Available at: http://www.ietf.org/rfc/rfc3188.txt
  17. Hallgrímsson, Þ., & Bang, S. (2003). “Nordic Web Archive.” 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
  18. Nordic Web Archive:
    Available at: http://nwa.nb.no/
  19. FAST.
    Available at: http://www.fastsearch.com/
  20. Albertsen, K. (2003). “The Paradigma Web harvesting environment.” 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
  21. Internet Archive.
    Available at: http://www.archive.org/
  22. The National Archives, UK Central Government Web Archive.
    Available at: http://www.pro.gov.uk/webarchive/
  23. Gomes, D., & Silva, M.J. (2003). “A characterization of the Portuguese Web.” 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
  24. Tumba, Viúva Negra crawler.
    Available at: http://www.tumba.pt/crawler.html
  25. Schneider, S. M., Foot, K., Kimpton, M., & Jones, G. (2003). “Building thematic web collections: challenges and experiences from the September 11 Web Archive and the Election 2002 Web Archive.” 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
  26. September 11 Web Archive.
    Available at: http://september11.archive.org/
  27. Election 2002 Web Archive.
    Available at: http://www.loc.gov/minerva/collect/elec2002/
    Search available at: http://webarchivist.org/minerva/DrillSearch
  28. webArchivist.org.
    Available at: http://www.webarchivist.org/
  29. Metadata Object Description Schema (MODS):
    Available at: http://www.loc.gov/standards/mods/
  30. Gross, J. (2003). “Learning by doing: the Digital Archive for Chinese Studies (DACHS).” 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
  31. DACHS - Digital Archive for Chinese Studies:
    Available at: http://www.sino.uni-heidelberg.de/dachs/
  32. Žabicka, P. (2003). “Archiving the Czech Web: issues and challenges.” 3rd ECDL Workshop on Web Archives, Trondheim, Norway, 21 August 2003.
    Available at: http://bibnum.bnf.fr/ecdl/2003/
  33. Hakala, J. (2001). “The NEDLIB Harvester.” Zeitschrift für Bibliothekswesen und Bibliographie, 48, 211-216.
  34. WebArchiv.
    Available at: http://www.webarchiv.cz/
  35. Bergman, M.K. (2001). “The deep Web: surfacing hidden value.” Journal of Electronic Publishing, 7(1).
    Available at: http://www.press.umich.edu/jep/07-01/bergman.html

Author Details

Michael Day
UKOLN
University of Bath

Email:m.day@ukoln.ac.uk
Web site: http://www.ukoln.ac.uk/

Return to top

Article Title: “3rd ECDL Workshop on Web Archiving” 
Author: Michael Day
Publication Date: 30-October-2003
Publication: Ariadne Issue 37
Originating URL: http://www.ariadne.ac.uk/issue37/ecdl-web-archiving-rpt/