In September 2006 The National Library of New Zealand Te Puna Mātauranga o Aotearoa, The British Library and Sytec, announced the successful development of a Web harvesting management system.
The system, known as Web Curator Tool, is designed to assist curators of digital archives in collecting Web-published material for storage and preservation.
The Web Curator Tool is the latest development in the practice of Web site harvesting (using software to 'crawl' through a specified section of the World Wide Web, and gather 'snapshots' of Web sites, including the images and documents posted on them). The Web Curator Tool is a further advance in the race to ensure the world's digital heritage is preserved for future generations and not lost through obsolescence and the ephemeral nature of Web materials.
The International Internet Preservation Consortium (IIPC) seeks to set up and support collaborative projects to develop standards and open source tools of practical benefit in the new field of Web archiving.
In particular, by 2004, several IIPC members were considering the need for a desktop solution to the new challenges of collecting Web materials - a tool that would assist curators practically, without requiring a high level of technical understanding of Web site design and construction, the structure and formats of Web sites, or of issues of storage, preservation and access in digital repositories. During 2005 the IIPC Content Management Working Group reviewed work on requirements independently prepared by the Library of Congress (LC), UK Web Archiving Consortium (UKWAC), National Library of Australia (NLA) and National Library of New Zealand (NLNZ).
The Library of Congress prepared a summary set of requirements and asked the four institutions (LC, UKWAC, NLA and NLNZ) to rate their importance; it compiled the results into a tentative set of functional specifications for a Web Curator Tool. The LC then prepared an initial set of textual use cases outlining many of the ways the Web Curator Tool was expected to be used, and which served as the basis for software development.
The National Library of New Zealand and The British Library were at that time, like many other IIPC member institutions, in the early stages of building a selective pilot archive of Web sites in anticipation of getting a remit to collect on a domain scale under Legal Deposit regulation.
The BL agreed at the Reykjavik IIPC meeting in June 2005 to collaborate with NLNZ and financially contribute to the proposed development, and confirmed this commitment at the Washington meeting in October 2005.
Both libraries had parallel programmes running to develop digital repositories which would in time be capable of storing and preserving ARC files of harvested Web sites, and potentially of merging in retrospective data acquired by agreement with the Internet Archive. Both were keen to exploit the Web harvesting extensions to the ARC format, which would allow better preservation and search and retrieval of stored Web materials.
The British Library meanwhile had been leading the inception of a joint UK pilot scheme to develop a collaborative archive of Web sites. This came about following a JISC/Wellcome Trust study, and six founder institutions signed a Consortium Agreement in 2004. They were:
The UK Web Archiving Consortium (UKWAC) was given permission to use free of charge the National Library of Australia's PANDAS software, installed and managed by Magus Research Ltd under contract to The British Library. Pandas uses the HTTrack Web copier tool to gather Web Sites for storage, adding an interface layer over this, and also workflow management processes for selection, recording permissions, targeting and scheduling gathers of recurrent updates, and the ability to correct harvesting errors before submission to the incorporated archive store.
Web sites were to be selected in accordance with the collection development policy of each institution, and permission to archive would have to be obtained in writing or by email before any material was harvested. The development of the UKWAC archive was originally envisaged as a two-year pilot project, from which it was expected the members would learn much about the selection, capture and storage of Web sites; and by which time Legal Deposit Web-published materials would have come under statutory regulation for the UK.
In fact Legal Deposit regulations have taken longer to be developed than was projected, and UKWAC continues to build its archive using Pandas. It is now keen to adopt new standards for Web archiving emerging through the IIPC (such as the WARC storage format for Web archives, put forward as a candidate international standard), and to use a new generation of tools exploiting these standards.
UKWAC now expects to continue collaboration for another year. It will move the operation over from Pandas to using Web Curator Tool and Heritrix for selection, harvesting and acquisition, and WERA (Web ARchive Access) tools for access, before the end of its current Consortium agreement in September 2007.
Through a collaboration agreement signed between the two funding libraries in March 2006, it was decided at the outset that The National Library of New Zealand (NLNZ) would lead the project and run the procurement of a software developer. Steve Knight of the National Digital Heritage Archive within the NLNZ was nominated Project Sponsor.
Sytec, a subsidiary of TelstraClear, was selected to provide the development team, under the immediate direction of Gordon Paynter of NLNZ, the Project Manager.
The Project's scope was to be controlled tightly to a condensed set of essential requirements. These were reviewed over a week-long start-up workshop in mid-April 2006. This was the only time the core project team actually ever assembled. Communications were difficult because of the 12-hour time difference between NZ and UK, and were mainly constrained to email - even then it was unusual to be able to respond the same day. We managed two video conferences held in the evening in the UK, and rather early in the morning in NZ - but most of the regular progress reporting and issue discussion was done by regular phone-calls from the BL team to Gordon's home in the evening (for him). We are very grateful to him for allowing this, as it would otherwise have been a pretty impersonal experience, and control over progress would have been much less efficient.
Both libraries saw this project as an adjunct to their own digital asset management programmes, and were looking to incorporate the selection and acquisition of Web materials into the mainstream of digital collection development.
The key business requirements for the project were:
These business benefits were expected from the project:
So the project began by working on a System Requirements Specification (SRS) for the Web Curator Tool, including:
Specific exclusions to the project scope were:
The Web Curator Tool was to be developed as an enterprise class solution, inter-operable with other organisational systems. It would have a user-centred design. It would enable users to select, describe and harvest online publications without requiring an in-depth knowledge of Web harvesting technology. It would provide auditable logs of major processes for tracking and problem resolution, it would incorporate workflows to identify the content for archiving and then manage it, including permissions, selection, description, scoping, harvesting and quality review, and customisable schedules for capturing regular updates of selected Web sites.
Both The British Library and the National Library of New Zealand aimed to integrate the Web Curator Tool into their own digital asset management programmes.
The project undertook to share the product with other organisations around the world as an open source release before the end of 2006.
Both libraries were keen that this project should deliver core functionality quickly, that could be released as open source software for general adoption, and subsequent shared development in the light of practical usage.
Timescales for the key project stages (all completed on schedule) were:
|IIPC Functional Requirements
IIPC Use Cases
|To June 2005
To October 2005
|Project Definition, Solution Scope
Software Requirements Specification
|Detailed Design||March-April 2006|
|User Acceptance Testing||July-August 2006|
|Open Source Release||September 2006|
The Web Curator Tool interface was designed to be user-friendly, easy to use and easy on the eye, as illustrated by the screen shots below:
WCT does use some specific terminology - for instance:
The Target is the unit of selection:
You can attach a Schedule to a Target to specify when (and how often) it will be harvested.
Scheduled Target Instances are put in a queue. When their scheduled start time arrives:
Examining the Queue gives you a good idea of the current state of the system:
The Owner (or another User) then has to:
The British Library specified additional multi-agency functionality to support its collaborative Web archiving within UKWAC, which is now built in to the product.
The Web Curator Tool Project has very quickly produced usable software with all the components needed to start harvesting Web sites for addition to an existing digital repository.
One of its success criteria was to have provided the application ready for implementation by the two collaborating libraries in 2006. This was achieved. The National Library of New Zealand has since deployed Web Curator Tool and is using it operationally. The British Library has installed WCT in a test environment, but has been short of technical resource to take it further - we now plan to run trials to include UKWAC partners in Spring 2007.
It is released under Apache License, Version 2.0. and the WCT Web site  contains further information for readers wishing to know more. Queries can be addressed to either UK or New Zealand staff .
Editor's note: Readers may be interested in a subsequent article: Jackson Pope, Philip Beresford. "IIPC Web Archiving Toolset Performance Testing at The British Library". July 2007, Ariadne Issue 52