Web Curator Tool

philip beresford

Web Curator Tool

Philip Beresford tells the story (from The British Library's perspective) of the development of new software to aid all stages of harvesting Web sites for preservation.

In September 2006 The National Library of New Zealand Te Puna M?tauranga o Aotearoa, The British Library and Sytec, announced the successful development of a Web harvesting management system.

The system, known as Web Curator Tool, is designed to assist curators of digital archives in collecting Web-published material for storage and preservation.

The Web Curator Tool is the latest development in the practice of Web site harvesting (using software to 'crawl' through a specified section of the World Wide Web, and gather 'snapshots' of Web sites, including the images and documents posted on them). The Web Curator Tool is a further advance in the race to ensure the world's digital heritage is preserved for future generations and not lost through obsolescence and the ephemeral nature of Web materials.

The International Internet Preservation Consortium (IIPC) seeks to set up and support collaborative projects to develop standards and open source tools of practical benefit in the new field of Web archiving.

In particular, by 2004, several IIPC members were considering the need for a desktop solution to the new challenges of collecting Web materials - a tool that would assist curators practically, without requiring a high level of technical understanding of Web site design and construction, the structure and formats of Web sites, or of issues of storage, preservation and access in digital repositories. During 2005 the IIPC Content Management Working Group reviewed work on requirements independently prepared by the Library of Congress (LC), UK Web Archiving Consortium (UKWAC), National Library of Australia (NLA) and National Library of New Zealand (NLNZ).

The Library of Congress prepared a summary set of requirements and asked the four institutions (LC, UKWAC, NLA and NLNZ) to rate their importance; it compiled the results into a tentative set of functional specifications for a Web Curator Tool. The LC then prepared an initial set of textual use cases outlining many of the ways the Web Curator Tool was expected to be used, and which served as the basis for software development.

The National Library of New Zealand and The British Library were at that time, like many other IIPC member institutions, in the early stages of building a selective pilot archive of Web sites in anticipation of getting a remit to collect on a domain scale under Legal Deposit regulation.

The BL agreed at the Reykjavik IIPC meeting in June 2005 to collaborate with NLNZ and financially contribute to the proposed development, and confirmed this commitment at the Washington meeting in October 2005.

Both libraries had parallel programmes running to develop digital repositories which would in time be capable of storing and preserving ARC files of harvested Web sites, and potentially of merging in retrospective data acquired by agreement with the Internet Archive. Both were keen to exploit the Web harvesting extensions to the ARC format, which would allow better preservation and search and retrieval of stored Web materials.

The UK Web Archiving Consortium

The British Library meanwhile had been leading the inception of a joint UK pilot scheme to develop a collaborative archive of Web sites. This came about following a JISC/Wellcome Trust study, and six founder institutions signed a Consortium Agreement in 2004. They were:

The British Library
The National Library of Scotland
The National Library of Wales / Llyfrgell Genedlaethol Cymru
The National Archives
The Wellcome Library
Higher Education Funding Council / JISC

The UK Web Archiving Consortium (UKWAC) was given permission to use free of charge the National Library of Australia's PANDAS software, installed and managed by Magus Research Ltd under contract to The British Library. Pandas uses the HTTrack Web copier tool to gather Web Sites for storage, adding an interface layer over this, and also workflow management processes for selection, recording permissions, targeting and scheduling gathers of recurrent updates, and the ability to correct harvesting errors before submission to the incorporated archive store.

Web sites were to be selected in accordance with the collection development policy of each institution, and permission to archive would have to be obtained in writing or by email before any material was harvested. The development of the UKWAC archive was originally envisaged as a two-year pilot project, from which it was expected the members would learn much about the selection, capture and storage of Web sites; and by which time Legal Deposit Web-published materials would have come under statutory regulation for the UK.

In fact Legal Deposit regulations have taken longer to be developed than was projected, and UKWAC continues to build its archive using Pandas. It is now keen to adopt new standards for Web archiving emerging through the IIPC (such as the WARC storage format for Web archives, put forward as a candidate international standard), and to use a new generation of tools exploiting these standards.

UKWAC now expects to continue collaboration for another year. It will move the operation over from Pandas to using Web Curator Tool and Heritrix for selection, harvesting and acquisition, and WERA (Web ARchive Access) tools for access, before the end of its current Consortium agreement in September 2007.

The UKWAC archive [1] is updated weekly and is freely available to researchers; the site also provides more background descriptive material on this Consortium.

The Project

Through a collaboration agreement signed between the two funding libraries in March 2006, it was decided at the outset that The National Library of New Zealand (NLNZ) would lead the project and run the procurement of a software developer. Steve Knight of the National Digital Heritage Archive within the NLNZ was nominated Project Sponsor.

Sytec, a subsidiary of TelstraClear, was selected to provide the development team, under the immediate direction of Gordon Paynter of NLNZ, the Project Manager.

The Project's scope was to be controlled tightly to a condensed set of essential requirements. These were reviewed over a week-long start-up workshop in mid-April 2006. This was the only time the core project team actually ever assembled. Communications were difficult because of the 12-hour time difference between NZ and UK, and were mainly constrained to email - even then it was unusual to be able to respond the same day. We managed two video conferences held in the evening in the UK, and rather early in the morning in NZ - but most of the regular progress reporting and issue discussion was done by regular phone-calls from the BL team to Gordon's home in the evening (for him). We are very grateful to him for allowing this, as it would otherwise have been a pretty impersonal experience, and control over progress would have been much less efficient.

Project Objectives

Both libraries saw this project as an adjunct to their own digital asset management programmes, and were looking to incorporate the selection and acquisition of Web materials into the mainstream of digital collection development.

The key business requirements for the project were:

Provide ability to perform scheduled event and selective harvests.
Provide integrated management and workflow for harvesting Web sites.
Use open standards and software in all project outputs, particularly those developed by the IIPC.
Integrate with existing technical architecture and other enterprise applications.
Use the Heritrix crawler developed by the Internet Archive.

These business benefits were expected from the project:

Improved ability to harvest Web sites for archival purposes with more efficient and effective workflow.
Automation of harvest activities currently performed manually.
Capture of harvested material with a more sophisticated and widely used crawler.
Capture of harvested material in .arc format, which has stronger storage and archiving characteristics.
Reduced training and induction needs for staff newly deployed on Web archiving (e.g. ability to spread the collection of Web materials to other e-media acquisitions staff).

So the project began by working on a System Requirements Specification (SRS) for the Web Curator Tool, including:

components to manage workflows for the following core processes:
1. Harvest Authorisation: getting permission to harvest Web material and make it available
2. Selection, scoping and scheduling: what will be harvested, how, when and how often?
3. Description: Dublin Core metadata
4. Harvesting: Downloading the material at the appointed time with the Heritrix Web harvester deployed on multiple machines
5. Quality Review: making sure the harvest worked as expected, and correcting simple harvest errors
a user interface, and in-built Help;
a scheduler that implements crawls with Heritrix;
a process for submitting harvested material to a digital archive; and
a modular architecture that allows new components to be substituted, extended or added to support the more specific requirements of other institutions.

Specific exclusions to the project scope were:

It is NOT a digital archive or document repository. It does not incorporate long-term storage or preservation
It is NOT an access tool. It does not provide public access to harvested material. It submits material to an external archive, which can be accessed using Wayback or WERA access tools
Implementing non-generic features which may be required by other IIPC members. The tool would be sufficiently modular to allow features to be extended or added at a later date.
Implementing tools for cataloguing harvested materials beyond the basic requirements of archive submission (e.g. information required to create Submission Information Packages).
Ongoing maintenance, development and support of the open source tool after its release.

The Web Curator Tool was to be developed as an enterprise class solution, inter-operable with other organisational systems. It would have a user-centred design. It would enable users to select, describe and harvest online publications without requiring an in-depth knowledge of Web harvesting technology. It would provide auditable logs of major processes for tracking and problem resolution, it would incorporate workflows to identify the content for archiving and then manage it, including permissions, selection, description, scoping, harvesting and quality review, and customisable schedules for capturing regular updates of selected Web sites.

Both The British Library and the National Library of New Zealand aimed to integrate the Web Curator Tool into their own digital asset management programmes.

The project undertook to share the product with other organisations around the world as an open source release before the end of 2006.

Project Timescale

Both libraries were keen that this project should deliver core functionality quickly, that could be released as open source software for general adoption, and subsequent shared development in the light of practical usage.

Timescales for the key project stages (all completed on schedule) were:

IIPC Functional Requirements IIPC Use Cases	To June 2005 To October 2005
Project Definition, Solution Scope Software Requirements Specification	November-December 2005
Procurement	January-February 2006
Detailed Design	March-April 2006
Development	May-July 2006
User Acceptance Testing	July-August 2006
Open Source Release	September 2006

Technical Design

Implemented in Java
Runs in Apache Tomcat
Incorporates parts or all of
- Acegi Security System
- Apache Axis (SOAP data transfer)
- Apache Commons Logging
- Heritrix (version 1.8)
- Hibernate (database connectivity)
- Quartz (scheduling)
- Spring Application Framework
- Wayback
Platform:
- Tested on Solaris (version 9) and Red Hat Linux
- Should work on any platform that supports Apache Tomcat
Database:
- A relational database is required
- Tested on Oracle and PostgreSQL
- Installation scripts provided for Oracle and PostgreSQL
- Should work with any database that Hibernate supports including MySQL, Microsoft SQL Server, and about 20 others

Interface Design

The Web Curator Tool interface was designed to be user-friendly, easy to use and easy on the eye, as illustrated by the screen shots below:

WCT does use some specific terminology - for instance:

A Target is the defined part(s) of a Web site to be harvested.
A Target Group is a set of URLs that, taken together, comprise a Web publication.
Target instances are the individual harvests that are either scheduled to happen, or are in progress, or which have finished. Target Instances are created automatically for a Target when that Target is approved. A Target Instance is created for each harvest that has been scheduled.

Figure 1: Main menu screen, showing the main areas of functionality

Workflow Outline

The Target is the unit of selection:

If there is something you want to harvest, archive and describe, then it is a Target.

You can attach a Schedule to a Target to specify when (and how often) it will be harvested.

But you cannot harvest until you have permission to harvest, and you cannot harvest until the selection is approved.

Figure 2: Target definition (one of several tabs)

Scheduled Target Instances are put in a queue. When their scheduled start time arrives:

WCT allocates the harvest to one of the harvesters
The harvester invokes Heritrix and harvests the requested material
When the harvest is complete, the User is notified

Examining the Queue gives you a good idea of the current state of the system:

WCT provides a quick view of the instances in the Queue, including Running, Paused, Queued, and Scheduled Instances

Figure 3: The User view of Target Instances shows all the instances that the user owns

The Owner (or another User) then has to:

Quality Review the harvest result to see if it was successful
- Using the Browse Tool to Browse the harvest result to ensure all the content is there
- Using the Prune Tool to Delete unwanted material from the harvest
Endorse or Reject the harvest
Submit the harvest to an Archive (if it has been endorsed)

The British Library specified additional multi-agency functionality to support its collaborative Web archiving within UKWAC, which is now built in to the product.

Figure 4: Choice of Quality Review tools:

Figure 5: WCT records logs of all harvesting operations, for tracking and problem solving in the Quality Review process

Figure 6: Example of a detailed log produced by WCT

Conclusion

The Web Curator Tool Project has very quickly produced usable software with all the components needed to start harvesting Web sites for addition to an existing digital repository.

One of its success criteria was to have provided the application ready for implementation by the two collaborating libraries in 2006. This was achieved. The National Library of New Zealand has since deployed Web Curator Tool and is using it operationally. The British Library has installed WCT in a test environment, but has been short of technical resource to take it further - we now plan to run trials to include UKWAC partners in Spring 2007.

Web Curator Tool open source software [2] is available from its location at SourceForge.net and contains:

Source code
Documentation: user and administrator guides, FAQ
Mailing lists

It is released under Apache License, Version 2.0. and the WCT Web site [2] contains further information for readers wishing to know more. Queries can be addressed to either UK or New Zealand staff [3].

Editor's note: Readers may be interested in a subsequent article: Jackson Pope, Philip Beresford. "IIPC Web Archiving Toolset Performance Testing at The British Library". July 2007, Ariadne Issue 52
http://www.ariadne.ac.uk/issue52/pope-beresford/

References

UK Web Archiving Consortium Web site http://www.webarchive.org.uk
The Web Curator Tool http://webcurator.sourceforge.net
Queries may be addressed to:wct@bl.uk or wct@natlib.govt.nz

Author Details

Philip Beresford
Web Archiving Project Manager
The British Library
Boston Spa
Wetherby
West Yorkshire
LS23 7BQ

Email: philip.beresford@bl.uk

Return to top