We have all heard at least some of the extraordinary statistics that attempt to capture the sheer size and ephemeral nature of the Web. According to the Digital Preservation Coalition (DPC), more than 70 new domains are registered and more than 500,000 documents are added to the Web every minute . This scale, coupled with its ever-evolving use, present significant challenges to those concerned with preserving both the content and context of the Web.
Co-organised by the DPC, the British Library and JISC, this workshop was the third in a series of discussions around the nature and potential of Web archiving. Following the key note address, two thematic sessions looked at ‘Using Web Archives’ (as it is only recently that use cases have started to emerge) and ‘Emerging Trends’ (acknowledging that Web archiving activities are on the increase, along with a corresponding rise in public awareness).
The keynote address presented a solution to navigating the Web in the past, and how this might be applied to the preservation and examination of the scholarly record. The Memento Project aims to provide a solution to a basic problem: the Web was built with the notion of the 'perpetual now' , and at any moment in time it is only possible to obtain the current representation of a resource through a URI. When using a past resource, you cannot navigate in the past; the links take you to current versions of resources, rather than providing the contemporary context. The Memento Project seeks to tackle this problem by bridging the gap between an 'original' resource (for which we want to find a prior version) and a prior resource (or 'memento') by introducing a ‘TimeGate’ that provides an intermediary resource linking them at the level of HTTP protocol. Working tools are already available, with a browser plug-in that allows you simply to select a date from time.
William Kilbride opens proceedings.
The characteristic of publications from the ‘paper era’ are that they are frozen in time. This has changed with the advent of the Web: now we are not only citing a paper, but also Web resources, which both exist within a broader Web context. All of this is subject to the blight of the ‘eternal now’, in that you can only see a publication in its current context. Is it possible to recreate context from the time of publication? Currently we can’t recreate that context because we are only archiving the papers themselves. Digital Object Identifiers (DOIs) provide redirection for link persistence to documents, but redirections change and new resources emerge; the browser only sees the current redirection and not the previous resource we wanted. You therefore need a TimeGate for the DOI itself, providing a link to the right resource. In conclusion, Herbert stated that we should be looking beyond the papers themselves and towards the context that surrounds them, and this is crucial for examining the scholarly record.
Herbert van der Sompel takes questions on the Memento Project.
Recent research  by the Oxford Internet Institute  reveals that there is reason to be dissatisfied with a ‘persistent gap’ between researchers and the creation of Web archives. There is a fear that Web archives could become the ‘dusty archives’ of the future. This is ironic, since so many of the dusty archives of the world are now being moved onto the Web. What steps can be taken to engage researchers? Core user groups (historians and social scientists) appear to lack both engagement and understanding concerning Web archives. This is a challenge: we acquire disciplinary bias early in our careers, which discourages innovation, and Web archives are not part of that acquired scholarly behaviour. Yet there are massive events going on in the world; and if a scholar is interested in collecting evidence from the Web of what is happening, the pity is that most people do not know how. We need someone to help people archive this material on an individual basis and develop ways to engage collecting.
Amanda and Tom explained that The National Archives (TNA) started archiving UK Government Web sites in 2003. Initially this provided only a ‘snapshot’ of Government Web publications, but now includes all government departmental groups, public enquiries, royal commissions, and even some NHS Web sites. Parliamentary librarians noted early on that URLs cited in UK Government publications could no longer be found, prompting TNA to develop the Web Continuity Initiative for link persistence within important government information, employing software to redirect users to the UK Government Web Archive . Having identified the limitations of their current conventional search solution, a semantic search is being developed, launching next year with a user interface for non-technical users, and an API for developers.
Peter Webster presented a curatorial view of Web archiving, describing examples of guest curation in the UK Web Archive , covering subjects such as: independent artist-led organisations, digital storytelling and the politics of religion in the UK since 7/7 . Web sites of interest are those which attempt to represent previous ages, and touch upon history and the media, or those which illustrate major events, new developments in national life, and older movements that are ending. The implications of legal deposit will be significant: if new legislation is passed next year this will remove difficulties over permissions. Currently, only 25-30% of requests for archiving are successful. Legal deposit would mean more comprehensiveness, but the focus then shifts from selection to curation. He felt there was an opportunity to use the crowd; the social media tools are there for users to browse and tag materials that they find interesting.
Data mining is on its way and real user value means data-level exploitation of Web archive content. Instead of page level access, a new interface now has more access options, including a 3D visualisation wall, word cloud access to special collections and an N-gram search function for instances of terms over time. By extracting the metadata behind images, there is also an enhanced image search. Collections are curated before they are collected, so they are broken down into various subject headings. It is a relatively small archive, containing only 1% of UK domains. To demonstrate data mining, 42,000,000 relationships were identified between Web pages and post codes. This then revealed the density of post code records across the UK. If, for example, they were filtered by a category like ‘business’ and some associations disappeared between 2007-2009, could it be possible to map the recession?
Kris Carpenter presented a forward-looking perspective on the Internet Archive’s activities . She identified two trends: first, individuals produce 70-80% of Web content; second, everyone wants to pull together the resources they use. How do we address diverse modes of access? The Internet Archive is trying to develop scalable architecture to mine data on a large scale to get clues about the content within a resource. There’s an interest in taking a wide range of data types and making them available, but how do you make them a part of the broader ecosystem? In the case of social networking, how will we represent it for examination twenty years from now? Collecting this content is easy, the challenge is re-rendering it from an archive to show the original resource by which it was referenced. You need a hybrid architecture that reveals all the elements required to re-render a resource (often composed of around 35 different files). So far, the Internet Archive has identified 200-300 million unique entities that they could represent as aggregations for study.
There are a number of challenges in archiving the social Web: there are many communities; there is uncertainty of future demand; and the technical challenges are significant. You can’t rely on a collect-all approach, it needs to be filtered and measured against criteria of demand: community memories that reflect communities’ interests. What is a reasonable way to create valuable collective memories? Arcomem is trying solve this by relying on crowds for intelligent content appraisal, selection, and preservation . The overall aim is to create an incrementally enriched Web archive that allows access to various types of Web content in a semantically meaningful way. The main challenge is to extract, detect and correlate events and related information from a large number of varied Web resources. Archivists can select content by describing their interests by using reference content, or by a high-level description of relevant entities. Arcomem will be interoperable with external resources with linked open data, event and entity models.
BlogForever is an EU-funded blog archive . Some might consider blogging ‘old hat’, but it appears that this particular communications paradigm is here to stay. You can preserve the content of a blog page, but somehow it’s not the same as it is in its original context. The objective of BlogForever is to develop robust digital preservation management and dissemination facilities, while capturing the rich essence of blogs. The outcomes are the definition of a generic data model for blog content metadata and semantics, as well as the definition of digital preservation strategies for blogs. Unstructured blog information is collected and then given a shape that can be interrogated in all sorts of ways. A survey was conducted of bloggers’ attitudes to preservation: 90% never used an external service to preserve their blog, but relied on the blog provider for preservation; 30% used re-mixed data, so this could raise permissions issues. The result will be a Weblog digital archiving solution.
Hanzo Archives, and several other companies, are now offering commercial archiving services, confirming commercial interest in the field . Commercial organisations are beginning to worry about research problems and many of their Web pages are extremely complicated. Companies have large amounts of digital data and are interested in discoverable content, such as trying to pull out social identities. The new trend emerging in the last few months is that people are coming to commercial archives wanting to collect not just for the sake of the content, but because they are interested in data and scale. Notably, they perceive the term ‘archive’ as an old word, a dead word. In many cases the individual pages are pretty dull, but it’s about the big data. There is a huge tide of these data, and there are not enough archives to collect them. Hanzo will be opening up tools via an API to bring Web archiving to everybody in the near future.
Kevin Ashley, Director of the Digital Curation Centre opened the panel session, remarking that in many ways the expectations held during the last discussion two years ago have been exceeded: what we want to do we can do, in some cases we have gone beyond. Thinking of the Web as data can help to address the problem that we can’t keep everything. Perhaps in some cases we can afford just to keep the wrappers around content, showing us what links to what, rather than exactly what was said. Is there a role for these stripped-down cases? Access isn’t about a new user interface, it’s about prompting different means and methods for access.
Kevin Ashley leads off the panel session and discussion.
Martha Anderson from the Library of Congress (LOC) commented that no one can serve up the whole of Twitter (not even Twitter themselves), and so of course the whole of the Web will not survive. Web pages as a medium are already dying and the idea of a Web page is similar to our idea of a bound book; we find it very hard to imagine something with which we are so familiar changing before our very eyes. This revelation came for the Library of Congress during a project to document the Japanese tsunami . LOC staff found that it was the social media that told the story, not Web pages constructed to commemorate the disaster. What people are interested in is becoming more and more granular. In Web archiving our practice is shaped by the media with which we are dealing.
Kris Carpenter noted that Google has been able to preserve its original search engine code. Computer scientists in the US are also interested in working on the now defunct Altavista search code. Herbert van der Sompel added that there is certainly confusion in the user experience within Web archives, especially with different versions of the same documents. The technology that Google has already displayed to help put search results under one heading would be better, rather than sending users straight into an archive to become lost. At that level, one could use search engines more to access archives, it’s not asking for a huge leap to take place since the technology already exists. Though search engines play a role in search, they don’t play a role in Web archiving. You can’t use their cache programmatically. They could play a larger role, but currently do not.
Peter Webster wondered that, while the Web is decentralised on an individual level, has the Web centralised on a community level? Martha Anderson answered that people used to have personal Web sites, now they have blogs that serve as an aggregation facility for many of them. When Library of Congress staff started archiving, they were collecting pages, because they thought people would be looking at pages. They also found that when researchers came, they were not interested in pages and pictures, but brought scripts on pen drives that they ran across the aggregate. We see this accelerated now with apps on portables because the Web pages do not exist anymore; the data is pulled from them and delivered instead. With so many more people creating content now, does this mean that they will be more concerned about archiving? The BlogForever survey said that 90% of bloggers thought their blog provider would preserve their blog. What do we say to them? We have said: ‘You realise your blog won’t be preserved at the moment,’ to scare them, and now we want to focus on the positive aspects.
Martha Anderson, Director of Program Management at the Digital Preservation Program, Library of Congress, reported that her programme discovered several years ago that it was fine to convince congressmen that this was important but they needed grass-roots advocacy, so they introduced personal digital archiving. They have talked to thousands of people, often focusing on personal photographs, saying, ‘You need to think about what you really care about and take some measures.’ Perhaps some day we’ll see posters on buses that say, ‘Preserve! Are you saving your digital stuff?’ But there is a cultural barrier: many people don’t think this is important enough to campaign for. Personal digital archiving is probably the way to go – one person at a time getting the message out.
William Kilbride of the DPC wondered about a gap regarding Google, or YouTube, or Flickr: all it takes is for Flickr to switch off and everything disappears. Should we be doing more in that space, too? Kevin Ashley responded that, while one approach is that you go to providers, the other is that you go to individuals and say, ‘If that’s where you’re storing it, in Flickr, it’s not safe.’ It’s been found that during a presentation of these issues to 8-9-year-olds, they understood the risk to digital content, and the awareness was there. At one point there were some companies which had been targeting individual bloggers – offering a permanent archive of their blog for a one-off sum of money. This was actually patented. Are those companies still here now? And did they make any money? Kris Carpenter added that there are some changes in private sector thinking about these issues, because the consumer has started wanting to archive: Wordpress has been interested in developing tools that allow a user to ask for their blog to be archived to a specific repository. The time is right to take the lead in this community – not forcing archiving on anyone but presenting the option.
Helen Hockx-Yu, Web Archiving Programme Manager at the British Library, asked where the focus of our efforts in this field should now lie. Martha Anderson provided an example of the organisation Global Voices , which collects citizen blogs from countries involved in the Arab Spring and events all over the world. Its representatives visited the Library of Congress with concerns about preserving their content. It was a curated collection, but they were not librarians or archivists, and they were concerned. There was the potential: hundreds of blogs in one collection that don’t need to be looked at; people should take advantage of the community out there, and take a broader approach. Rather than using a single interface for the whole Web, we could encourage communities to devise bespoke methods for access. The bridge between institutions and communities is not as strong as it could be.
Herbert van der Sompel echoed the day’s prediction that data mining will be ‘really big,’ adding that this was not to advocate against keeping traditional navigation. Kevin Ashley recalled that at the first DPC discussion on this in 2002, there was already an algorithm better than human selection. Now we have far better technology, yet there’s a feeling that somehow we can’t quite trust this in libraries and archives. It’s not that there is no role for humans, but we have these technologies, and in ten years not that much has changed since it is still not accepted by librarians.
Neil Grindley, Digital Preservation Programme Manager at JISC, asked whether, despite the grand challenges and the divergence in how we need to work illustrated by the preceding talks, we should also look at developing standards. There were lots of ways of doing this, should we think about convergence? We’ve got WARC (Web ARChive) format , do we need more? Helen Hockx-Yu briefly updated us on the fact that both ISO and BSI have been working on things, putting together a report that could be used to assess Web archives, though probably more from a service provision point of view. It is work in progress.
William Kilbride questioned Eric Meyer directly about his thoughts for the future. His response was that there is a lot of room for optimism, certainly based on the examples presented at this workshop. The next step is to get such examples into communities that can see uses for them. Time and again, the best way to communicate this is for peers to present to each other within the same domain. It’s difficult to just be shown these tools, but if you present it to historians as a way of answering a historical question, for example, then there may be more uptake. Those are the next steps. It needs to percolate out to those who will dream up lots of clever things to do with these archives given the right tools and incentives.
As Mark Williamson of Hanzo Archives put it during his talk, the Web is the first medium in history that has no way of being inherently ‘kept’; yet it is where the majority of human creation is now happening. The significance and potential of Web archiving was reflected in the large number of delegates from all over Europe and the United States in attendance, many of whom noted their appreciation of the range and depth of the presentations.
While large technological advances have been made and ambitious projects exist to tackle the intricacies of Web archiving (in particular the social Web), it’s clear that there is concern about the uptake of these advanced technologies in host institutions and the apparent gap between Web archives and their primary user groups. In particular, most of the presentations alluded to the emergence of the analysis of large data aggregates extracted from Web archives, which was not a widely considered use during the initial Web archive developmental phase some ten years ago.
Having identified these challenges, it was widely accepted that we are in the midst of a new phase in Web archiving which offered many opportunities for growth, evolution and user engagement. This was a well-organised, informative and motivating day that provided the delegate with some powerful means and incentives to push Web archiving forward.
Matthew Brack is a student on the MA in Digital Asset Management at the Department of Digital Humanities, King's College London, and works as Digitisation Support Officer at the Wellcome Library. His current research focuses on bridging the gap between digital content and the user, and innovation within digital collections. He blogs on these topics at: http://mattbrack.blogspot.com/