Web-archiving: Managing and Archiving Online Documents and Records
Web sites are an increasingly important part of this country’s information and cultural heritage. As such, the question of their preservation through archiving becomes one which organisations need to be increasingly aware of. This event, organised by the newly-created Digital Preservation Coalition (DPC), brought together key organisations in the field of web archiving in order to assess the needs of organisations involved in the field to archive their and others’ web sites, to find areas of agreement, to highlight good practice, and to influence the wider debate about digital preservation.
Neil Beagrie, Secretary of the DPC, began the day’s proceedings by welcoming delegates to the event, the first event on web archiving to be organised by the DPC. He stressed the importance of the issue, as did the first speaker, Catherine Redfern from the Public Record Office (PRO). Web sites are records, and as such, need to be archived. But selection was necessary too, said Ms Redfern. But what are the criteria to be employed in such a process of selection? And how important is the capturing of the ‘experience’ of using the web site given that the look and feel of a site are an intrinsic part of the record. It was important, concluded Ms Redfern, to accept that perfect solutions do not exist, and that flexibility means that it may be the case that different solutions existed for different web sites.
Brian Kelly of UKOLN followed, and emphasised the sheer scale of the challenge by looking at attempts to define and measure UK web space. Different organisations came up with different measurements, but a figure of 3m public web servers was given which contained .uk within their URLs. Preserving web sites which we are unable to count will prove particularly difficult, he said, but perhaps the most important question was: at what rate is the UK web space growing? A number of the web sites of e-Lib projects were disappearing soon after their funding had finished. This led to a pilot study which came up with a number of conclusions about the way forward in this area. Brian Kelly also referred to the Internet Archive (www.archive.org/) which is offering permanent access to historical collections that exist in digital format.
Comparisons with other international situations are important in this context, and Julien Masanes from the Bibliotheque Nationale de France (BnF), gave the French perspective on these questions, where the Government is currently in the process of modifying the law regarding legal deposit of online content. The BnF is currently researching the best way to manage procedures of selection, transfer and preservation, which could be applied on a large scale within the framework of the proposed law. Two large-scale projects are proposed as part of this ongoing research. The first one has begun and involves sites related to the presidential and parliamentary elections that will take place in Spring 2002 in France. More than 300 sites have already been selected and the BnF collects about 30 Gb per week. The second project will be a global harvesting of the ‘.fr’ domain in June.
If the sheer scale of the amount to be archived presents a major challenge, it is one that the BBC, with a million pages on its web site, and each regularly being updated, faces as a matter of course. Cathy Smith of the BBC spoke about the huge logistical and legal problems that this can involve. The BBC’s Charter responsibilities mean that it must archive its content, while its message boards, live chat forums, etc. mean that Data Protection becomes a serious issue in this context too. Multi-media content, often created through non-standard production processes, add further problems while proposals to extend the period within which the public can make formal complaints from one year to three years, has important consequences for the amount that will need to be archived. Ms Smith talked of the need for archived material to be directly accessible to users as a way of avoiding the ‘gatekeeper’ culture of traditional archives, and once again emphasised the fact that an archive needs to recreate the look and feel of the original record since this was an important aspect of what it is that the BBC does.
A number of reports from DPC members followed in the afternoon. Stephen Bury of the British Library spoke of some of the criteria used by the BL under its current archiving activities, given the lack of legal deposit requirements. These criteria include topicality, reflecting a representative cross-section of subject areas, etc. Stephen Bailey, Electronic Records Manager for the Joint information Systems Committee (JISC) spoke of the JISC’s attempts to implement its own recommendations in this area with its current project of redesigning its own web site. The archive module of the new site will allow for identification and retention of key pages and documents and will also allow a greater degree of functionality for end users. Centralised control; of the web records’ lifecycle will allow for greater uniformity but will place demands on content providers. Long-term preservation will, however, be a key requirement of the new site, he explained.
Steve Bordwell of the National Archives of Scotland asked whether we should even be attempting to preserve web sites, and whether we should rather be focussing on content. Snapshots or web cams might provide us with the look and feel of archived web sites, he suggested. David Ryan of the PRO looked at the project to preserve the No. 10 web site, and asked what an acceptable level of failure might be in terms of archiving and preservation procedures, while Kevin Ashley of the University of London Computer Centre (ULCC), suggested that we need to think what the purpose of Government web sites is precisely, what their significant properties are in order to formulate criteria for selection, and so on.
Robert Kiley spoke about the joint Wellcome/JISC archiving feasibility study which is looking at archiving the medical Web. Once again, the sheer volume of the medical Web presents significant problems for selection: quality would be one criterion, but how should we judge quality ? In addition, many database are published only electronically, while discussion lists and e-mail correspondence are potentially of immense importance to future generations of researchers. The study will produce recommendations, Mr Kiley reported, on how some of the questions can be answered, as well as answers to the questions of copyright, costs and the maintenance of any medical web archive.
Discussion throughout the day ranged across a number of areas, but questions of standards, selection criteria, and cost, dominated the proceedings.
|Neil Beagrie and Philip Pothen|