Web Magazine for Information Professionals

Missing Links: The Enduring Web

Alexandra Eveleigh reports on a workshop on Web archiving, organised by the DPC, JISC and UKWAC at the British Library on 21 July 2009.

This workshop, jointly sponsored by the DPC [1], JISC [2] and UKWAC [3], aimed to bring together content creators and tool developers with key stakeholders from the library and archives domains, in the quest for a technically feasible, socially and historically acceptable, legacy for the World Wide Web.

Setting the Scene

Adrian Brown, Assistant Clerk of the Records at the Parliamentary Archives [4], set out the framework for ‘securing an enduring Web’ around the key elements of selection, capture, storage, access and preservation. He identified new selection challenges arising from today’s dynamic, personalised Web sites, and issues of ‘temporal cohesion’, where capture cannot keep pace with the rapid rate of content change. Capture tools therefore needed to evolve in line with the changing nature of the Web itself, and we needed to find technical solutions for the longer-term accessibility and maintenance of very large quantities of interlinked, complex data.

Hanno Lecher from Leiden University [5], highlighted a more immediate concern, that of keeping reliable access to Web resources cited in academic publications. Whilst advocating the use of citation repositories to maintain copies of Web-published content, he noted that this approach is very labour-intensive, and suggested the use of applications such as SnagIt [6] or Zotero [7], or the International Internet Preservation Consortium’s [8] WebCite service [9], as other options.

photo (4KB) : Figure 1 : Adrian Brown gives the opening address

Adrian Brown gives the opening address

Eric Meyer of the Oxford Internet Institute [10] spoke about the World Wide Web of Humanities Project [11], which aimed to enable researchers to extract thematic collections from the Internet Archive [12], and to provide enhanced access to the associated metadata. Meyer also touched upon an identified need to move away from collecting snapshots of the Web towards more continuous data, in order to facilitate temporal studies on Web archives, such as the growth of news networks or the development of the climate change debate.

Creation, Capture and Collection

Helen Hockx-Yu, Web Archiving Programme Manager at the British Library [13], gave an overview of the software tools available to support and manage the Web archiving process, also noting gaps in current provision. Overall, she painted a picture of a Web archiving community always having to play catch-up with the inherent creativity of the Web itself. In terms of preservation of Web content, there is still little consensus over strategy, practices or the use of specific tools, although international collaboration in the field has led to some convergence on certain crawlers and the development of the new WARC file format as an international standard ISO 28500: 2009, by the IIPC (International Internet Preservation Consortium) [8].

Cathy Smith, Collections Strategy Manager at The National Archives [14], gave an overview of a recent research study, looking at what audiences Web archives can anticipate and what the Web might look like as a historical source. Should Web archivists aim at building a holistic, but shallow view of the whole UK Web domain, or harvest specific sites in depth, along thematic lines? Preliminary findings suggest that users would prefer to use a national Web archive, although this does not necessarily imply a single repository. Existing institutions could continue to provide access to local Web collections, but there should be coordination to eliminate potential overlaps arising from differing thematic, legal and geographical collecting remits.

Amanda Spencer and Tom Storrar, also from The National Archives, spoke about TNA’s Web Continuity Project [15], which combines comprehensive capture of UK central government Web sites with the deployment of redirection software to ensure persistent access from live sites to archived Web resources. The team has also been working to influence policy makers and content creators to promote best practice in Web site construction, leading to more successful harvesting of site content.

photo (45KB) : Panel shot of (left to right): Cathy Smith, Tom Storar, Amanda Spencer and Helen Hockx-Yu [British Library]

Panel shot of (left to right): Cathy Smith, Tom Storar, Amanda Spencer and Helen Hockx-Yu [British Library]

Issues and Approaches to Long-term Preservation of Web Archives

Richard Davis, Project Manager, gave an introduction to the ArchivePress blog-archiving project [16] being undertaken by the University of London Computer Centre [17] and the British Library Digital Preservation Department. Describing the complex (and expensive) Web harvesting tools currently available as ‘using a hammer to crack a nut,’ when it comes to blogs, the ArchivePress team will seek to exploit a universal feature of blogs – newsfeeds – as the key to gathering blog content for preservation. The approach could possibly be later adapted to harvest from Twitter.

Maureen Pennock, Web Archive Preservation Project Manager at the British Library, explained some of the issues involved in the longer-term preservation of Web content, beyond capture. Initiatives instigated by the British Library to protect the contents of the UK Web Archive [3] from obsolescence include a technology watch blog [18], a regular risk review of file formats held within the archive, coupled with migration to the container WARC format, and the creation of a Web preservation test-bed. Maureen also outlined some challenges for the future, such as the growth of closed online communities and of personalised Web worlds. Finally, she made the important point that ‘preservation is best if it begins at source’, emphasising the need to produce Web content which is optimised for harvesting.

Thomas Risse introduced the Living Web Archive (LiWA) Project [19], which seeks to develop the ‘next generation of Web content capture, preservation, analysis, and enrichment services’. The new approach will go beyond harvesting static snapshots of the Web and enable the capture of streaming media, link extraction from dynamic pages, and include methods for filtering out Web spam.

Jeffrey van der Hoeven spoke about work on the emulation of old Web browsers in Global Remote Access to Emulation-Services (GRATE) as part of the Planets Project [20], and within the framework of KEEP (Keep Emulation Environments Portable) [21]. He pointed out that the current generation of crawlers assume a PC-based view of the Web. With ever-increasing capabilities of mobile presentation devices, he suggested that in future the focus will need to change towards capturing content, with emulation used to recreate different views of the same content.

What We Want with Web Archives: Will We Win?

The conference concluded with a roundtable discussion, following Kevin Ashley’s glimpse into the future of the Web’s past. He argued that future researchers will not just want to browse individual Web pages, but will want to exploit the inherent properties of Web content in aggregate – introducing the concept of ‘mashups in the past’. This assumes access to archived Web data in bulk, permitting machine-to-machine interaction with different sources of historical Web content.

Questions centred on how best to engage Web users in selecting and appraising Web content for preservation, obtaining permission for harvesting, and the potential impact of enhanced legal deposit legislation in the UK. Suggestions included the idea that popular Google searches might be used as one method of selecting content for capture, and a proposal to archive the UK domain name registry.

Slides of all the day’s presentations are available on the DPC Web site [22].

References

  1. Digital Preservation Coalition Web site http://www.dpconline.org
  2. Joint Information Systems Committee Web site http://www.jisc.ac.uk
  3. UK Web Archive http://www.webarchive.org.uk
  4. Parliamentary Archives http://www.parliament.uk/publications/archives.cfm
  5. The Digital Archive for Chinese Studies, Lieden Division http://leiden.dachs-archive.org/
  6. SnagIt Screen Capture Software http://www.techsmith.com/screen-capture.asp
  7. Zotero http://www.zotero.org/
  8. International Internet Preservation Consortium Web site http://netpreserve.org
  9. WebCite® http://www.webcitation.org/
  10. Oxford Internet Institute Web site http://www.oii.ox.ac.uk/
  11. World Wide Web of Humanities http://wwwoh-access.archive.org/wwwoh/
  12. Internet Archive http://www.archive.org
  13. British Library Web Archiving Programme http://www.bl.uk/aboutus/stratpolprog/digi/webarch/index.html
  14. The National Archives Web site http://www.nationalarchives.gov.uk
  15. The National Archives Web Continuity Project http://www.nationalarchives.gov.uk/webcontinuity/
  16. ArchivePress Project Web site http://archivepress.ulcc.ac.uk/
  17. University of London Computer Centre http://www.ulcc.ac.uk/
  18. UK Web Archive Technology Watch http://britishlibrary.typepad.co.uk/ukwebarchive_techwatch/
  19. Living Web Archives http://www.liwa-project.eu/
  20. Planets Project Web site http://www.planets-project.eu/
  21. Keep Emulation Environments Portable http://www.keep-project.eu
  22. Digital Preservation Coalition: JISC, the DPC and the UK Web Archiving Consortium Workshop: missing links: the enduring web
    http://www.dpconline.org/graphics/events/090721MissingLinks.html

Author Details

Alexandra Eveleigh
Collections Manager
West Yorkshire Archive Service

Email: aeveleigh@wyjs.org.uk
Web site: http://www.archives.wyjs.org.uk

Return to top