ECDL2004: 4th International Web Archiving Workshop, September 2004

michael day

ECDL2004: 4th International Web Archiving Workshop, September 2004

Michael Day reports on the 4th International Web Archiving Workshop held at the University of Bath in September as part of ECDL 2004.

An annual Web archiving workshop has been held in conjunction with the European Conference on Digital Libraries (ECDL) since the 5th conference, held in September 2001 [1]. The University of Bath, UK hosted the 4th workshop in the series - now renamed the International Web Archiving Workshop - on 16 September 2004 [2]. Julien Masanès of the Bibliothèque nationale de France (BnF) welcomed around 60 delegates to Bath to listen to ten presentations and hoped that these would prompt much useful discussion.

Technologies

Julien Masanès himself gave the first presentation, on the International Internet Preservation Consortium (IIPC) [3]. The objectives of the consortium were primarily practical, organised into research projects and working groups. The consortium was co-ordinated by the BnF; members currently included the Internet Archive and eleven national libraries from Europe, North America and Australia. There were six working groups. The 'framework' working group was investigating the technical basis of Web archiving, including architectural designs, API (application programming interface) specifications, exchange formats, etc. This group would be producing a general overview document, a new version of the Internet Archive's ARC storage format, and a metadata specification that could be used to document the context of Web resources. The 'metrics and testbeds' group had already published two reports [4] [5], both focused on the practical issue of developing a test bed for Web crawlers. The 'Deep Web' group was focusing on developing tools for deposit-driven Web archiving and the capture of Web content inaccessible to crawlers. The 'researchers requirements' working group was working with the potential users of Web archives to help define a common vision on, for example, selection criteria, update frequencies and associated documentation. Other working groups covered 'access tools' and 'content management.' In addition to the working groups, IIPC also included some projects, e.g. the development of the open-source Heritrix crawler and some 'smart modules' for the characterisation of Web sites or for citation linking.

Gordon Mohr of the Internet Archive then gave an introduction to the Heritrix crawler [6]. He explained that while the Internet Archive had extremely large collections (around 40 billion resources, 400+ terabytes), the vast majority of this data had come from Alexa Internet, and the archive had no control over the proprietary crawlers that were being used. With the archive itself doing some crawling - e.g. on behalf of the Library of Congress or the UK National Archives - it had a requirement for an open-source, extensible, Web scale, archival quality crawler. Use cases were developed for broad, focused, continuous and experimental crawling and a highly modular program was developed and released in 2004. From that year the program had been used for all crawling done by the Internet Archive. The crawler has a Web interface and is highly configurable, making it particularly good for focused or experimental crawls. The discussion raised the question of scalability and the need for crawlers to interface with other systems, e.g. the PANDAS archive management software.

Keeping to the technical theme, Younès Hafri of the Ecole Polytechnique de Nantes and the French Institut National de l'Audiovisuel (INA) then introduced a new distributed crawler system called Dominos. This system had been developed in the context of a research programme investigating the possibility of extending the legal deposit law in France to cover the Web. Within this, the INA were responsible for looking at the feasibility of collecting and managing multimedia resources. An essential requirement for this is the ability to crawl a large number of Web resources in the fastest possible time. Hafri argued that such large-scale crawlers needed to be flexible, have high performance, be tolerant of faults, e.g. in network and server behaviour, and to be configurable and easy to maintain. The Dominos system was specifically designed to maintain very high download rates, e.g. storing most data in memory rather than on disks and delegating tasks to reduce the number of active processes in the system. A fuller technical description of the Dominos system is available elsewhere [7].

Identifying Web pages that have been changed are an important way of avoiding the repeated harvesting of Web pages. Lars Clausen of the State and University Library in Århus introduced a survey of the use in the Danish Web domain of two indicators of change available in the HTTP header, the datestamp and etag [8]. A survey of the front pages of the Danish domain was used to investigate the reliability and usefulness of the indicators, both individually and in combination. The study discovered that in this particular dataset, the etag, though often missing, was a more reliable indicator of change than the datestamp. Also where the etag was missing, the datestamp was not to be trusted. The initial conclusion was that the indicators were perhaps more reliable than had previously been thought, although this may have been a reflection of the unrepresentative nature of the dataset. In any case, more research into this is required.

Jose Coch of Lingway gave a presentation on a French Ministry of Research-funded project called WATSON (Web: analyse de textes, sélection et outils nouveaux). This used language-engineering techniques for supporting decisions on measuring the 'importance' of sites, which was especially significant for the deep Web problem. The approach was broadly similar to Google's PageRank, and correlated reasonably well with manual assessments made by librarians, with some specific exceptions. The aim was to develop tools that would be able to identify less important sites, to help the librarian select which ones should be collected, and ultimately to help the user of the archived site. Coch described the main components of the system, which included modules for site characterisation, the logical structuring of Web sites, the recognition of named entities (persons, organisations, places and dates), the semantic markup of sentences, and morphological tagging. The resulting summaries can be used to support the manual evaluation of sites and provide information to future researchers.

In the final presentation of the technologies session, Niels H. Christensen of the Royal Library of Denmark introduced the potential of format registries for Web archiving. He started by noting that Web archives are unable to control what exact formats they collect, and that access would depend on relating objects to the correct application. Format registries would need to include ways of finding out what a particular format was and give access to viewers and converters. Current initiatives included the Global Digital Format Registry [9] and the UK National Archives PRONOM service [10]. The lively discussion that followed this presentation demonstrated the current perceived importance of format registries, although concerns were raised about embedded formats and intellectual property rights.

Experiences

After lunch, Yan Hongfei of Peking University, Beijing, gave the first presentation in the case studies session. He gave an overview of Web archiving developments in China and in particular the Web Infomall Project, which has been collecting Chinese Web pages on a regular basis since 2002. The project uses a configurable crawler program called Tianwang to collect pages on an incremental basis, only downloading pages that are new or modified. For storage, the Web Infomall uses the Tianwang storage format, which includes the content and some basic metadata, e.g. the URL, date of download and IP address. Additional metadata is generated at the time of capture, e.g. the last modified date, the content type of the page, character encoding and language, and a MD5 signature. Each downloaded page is assigned a unique identifier and is made freely available through the Web Infomall site [11].

Paul Koerbin of the National Library of Australia (NLA) then gave an introduction to the latest developments with the PANDORA Web archive [12]. He started by noting that PANDORA had always been a pragmatic and practical initiative and that it was now seen as part of the routine work of the national library and its partners. The initiative is selective in what it collects and negotiates permissions with the site owners. The collection process involves a great deal of manual quality assurance, which adds considerable value but is extremely labour intensive. The archive in August 2004 had around 6,500 titles, with 21 million files, comprising 680 Gigabytes for the display copies only. For management, the NLA had developed and continued to enhance the PANDAS (PANDORA Digital Archiving System) workflow management system [13]. This is used to collect and store administrative metadata, initiate the harvesting and quality assurance processes, and to prepare resources for display. The system currently works with the HTtrack crawler, but will be moving to include Heritrix in the future. Some key Web archiving processes still remain outside the scope of PANDAS, e.g. the Web site selection process and the generation of descriptive metadata. For preservation, PANDORA keep a number of master copies - including a 'preservation master' which is not touched at all - which are stored on tape in the NLA's Digital Object Storage System. Koerbin noted that the NLA remained committed to the selective approach, although there was a realisation of the future need to scale up, for better ways of identifying and selecting candidate sites. There was also a perceived need to develop PANDAS further, e.g. to cope with the automatic ingest of large volumes of Web data, to comply with standards like those being developed by the IIPC, and to operate with complete domain harvesting approaches. The current priority was to make the software more stable and open source, so that development effort could be shared with others.

Alenka Kavcic-Colic of the National and University Library of Slovenia introduced a project that was experimenting with collecting the Slovenian Web. The project started in 2002 and was a co-operation between the National and University Library and the Jožef Stefan Institute in Ljubljana. The project used crawler programs to harvest the Slovenian Web domain, those sites physically located in Slovenia, and those that used the Slovenian language or included topics relevant to Slovenia. Some classes of document, e.g. journals and maps, were enhanced by the creation of descriptive metadata and can be found through the library's catalogue. The Jožef Stefan Institute had focused on using data mining techniques like link analysis, named entity extraction, text categorisation, and context visualisation, to support end-user access. The discussion after the presentation concentrated on the difficulties of defining national domains for Web harvesting. This raised the important issue of co-ordination between national initiatives and the need for standardised mechanisms to support discovery and the exchange of information between archives.

Jared A. Lyle of the University of Michigan gave the final presentation on approaches to appraising the Web. Lyle was interested in whether sampling techniques developed by archivists in the paper era could be used for the selection of Web sites. After a brief introduction to archival appraisal and its use in the twentieth century, Lyle noted that in the digital age, with falling storage costs, archives were perhaps becoming less selective. In the 1970s and 1980s, archivists widely used sampling as a way of reducing the bulk of some classes of paper-based record while maintaining a representative amount for future use. The types of sampling technique used included purposive (i.e. those records judged to be of lasting value), systematic, random, that based on certain physical characteristics, or mixed mode. Some experiments had been done with sampling on 4 million pages captured from the University of Michigan's Web domain (umich.edu). The first approach tried to apply purposive sampling, through identifying pages with high levels of content, large file sizes or short URLs, but the results were unsatisfactory. Further approaches used pure random sampling, the results of which are not representative, although Lyle thought it might have uses in selecting some particular types of page, e.g. personal student pages. The conclusions were that sampling is cheaper than other approaches, can help reduce redundancy, may be less biased, but will definitely not solve all selection problems. The discussion afterwards revealed a strong scepticism about the use of sampling techniques, although it was acknowledged that Web archiving initiatives already do some ad hoc sampling, e.g. based on the technical limitations of crawler programs.

The final discussion mostly concerned the feasibility (or wisdom) of keeping all human knowledge. Differing opinions were offered, but several delegates noted that collecting and preserving the Web represented a much smaller task - at least in storage terms - than, for example, television and radio broadcasts.

Summing Up

As in past years, the International Web Archiving Workshop was a useful forum for those involved or interested in Web archiving initiatives to gather together to hear about new technologies and projects, and to discuss some key issues. In a previous workshop report, I noted a move away from talk about collection strategies and crawler technologies towards a better consideration of user needs and access requirements [14]. The main 'theme' of the 2004 workshop seemed to be co-operation between Web archives, both in terms of the International Internet Preservation Consortium, but also more widely. ECDL 2005 will be held next September in Vienna, Austria [15], so it is hoped that IWAW will return there so that we can continue the discussion.

The full text of all presented papers will be available from the IWAW Web pages in late 2004.

References

International Web Archiving Workshops. Retrieved 29 October 2004 from http://bibnum.bnf.fr/ecdl/
4th International Web Archiving Workshop (IWAW04). Retrieved 29 October 2004 from http://www.iwaw.net/04/
International Internet Preservation Consortium. Retrieved 29 October 2004 from http://www.netpreserve.org/
Boyko, A. (2004). Test bed taxonomy for crawler, v. 1.0. International Internet Preservation Consortium, 20 July. Retrieved 29 October 2004 from http://www.netpreserve.org/publications/iipc-r-002.pdf
Marill, J., Boyko, A., & Ashenfelder, M. (2004). Web harvesting survey, v. 1. International Internet Preservation Consortium, 20 July. Retrieved 29 October 2004 from http://www.netpreserve.org/publications/iipc-r-001.pdf
Heritrix crawler. Retrieved 29 October 2004 from http://crawler.archive.org/
Hafri, Y., & Djeraba, C. (2004). "High performance crawling system." In: Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, New York, USA, 15-16 October 2004, New York: ACM Press, pp. 299-306.
Clausen, L. (2004). "Concerning etags and datestamps." State and University Library, Århus and Royal Library, Copenhagen, Denmark, July. Retrieved 29 October 2004 from http://www.netarchive.dk/website/publications/Etags-2004.pdf
Global Digital Format Registry. Retrieved 29 October 2004 from http://hul.harvard.edu/gdfr/
The National Archives, PRONOM. Retrieved 29 October 2004 from http://www.nationalarchives.gov.uk/pronom/
Web Infomall. Retrieved 29 October 2004 from http://www.infomall.cn/index-eng.htm
PANDORA Archive. Retrieved 29 October 2004 from http://pandora.nla.gov.au/
PANDORA Digital Archiving System (PANDAS). Retrieved 29 October 2004 from http://pandora.nla.gov.au/pandas.html
Day, M. (2003). "3rd ECDL Workshop on Web Archiving." Ariadne, 37. Retrieved 29 October 2004 from http://www.ariadne.ac.uk/issue37/ecdl-web-archiving-rpt/
ECDL 2005, Vienna, Austria, 18-23 September 2005. Retrieved 29 October 2004 from http://www.ecdl2005.org/

Author Details

Michael Day
UKOLN
University of Bath

Email: m.day@ukoln.ac.uk
Web site: http://www.ukoln.ac.uk/

Return to top