Preservation of Web Resources: Making a Start
A university's Web site is typically an honest reflection of the university, which is often an uncomfortable state of affairs for its managers. I was reminded of this as I negotiated my way from Senate House's cycle bays to the Dr Seng Tee Room at the University of London. Having arrived in time, Reception – one person behind wood and glass – thought I was looking for Dr Seng Tee. A 404 . When freed to my own wits I soon found myself in the Library and entering what appeared to be a private study. Expecting a 403 , I was rescued by the murmur of a crowd in an adjacent antechamber and arrived just in time to pour a cup of tea and secure my place – reserved through consensus – at the centre of the front row. Senate House is grand but it's seen better days. Not intuitive to use so intuition is required, and familiarity leads to a quirky fondness. Had the organisers' deliberately chosen the venue as a metaphor of the modern university Web site? I found myself in the right state of mind to ponder Web preservation as the workshop's Chair – Marieke Guy of UKOLN – captured our attention.
The JISC-PoWR - Preservation of Web Resources - project  began on 28 April 2008 and will run until the end of September 2008. It is being undertaken jointly by UKOLN at University of Bath and ULCC Digital Archives Department. Its aim is to
"organise workshops and produce a handbook that specifically addresses digital preservation issues that are relevant to the UK HE/FE web management community".
Preservation of Web Resources: Making a Start  was the first of three workshops being run on the 27 June (London), 23 July (Aberdeen) and 12 September (Manchester). As its title suggests, this was the start of engagement with the practitioners it seeks to support.
Using presentations and discussion groups, Web preservation as a concept was addressed, its importance considered, and its technological, institutional and legal challenges explored. The programme – which differed slightly from that which was advertised – was as follows:
- Morning (10:00 to 13:00):
- Preservation of Web Resources Part I by Kevin Ashley (ULCC)
- Challenges for Web Resource Preservation by Marieke Guy (UKOLN)
- Bath University Case Study by Alison Wildish and Lizzie Richmond (University of Bath)
- Afternoon (14:00 to 16:00):
- Legal issues by Jordan Hatcher (opencontentlawyer)
- Preservation of Web Resources Part II by Ed Pinsent (ULCC)
- ReStore: A sustainable Web resources repository by Arshad Khan (National Centre for Research Methods)
In addition to the presentations, two breakout sessions – one in the morning and one in the afternoon – distributed the 35 or so delegates into two groups to discuss 'What are the Barriers to Web Resource Preservation?' and 'Preservation Scenarios'. It was clear from Marieke's introduction that we were the initial participants in a process that would lead to a practitioner's handbook. Was this another metaphor or will September reveal the sort of tangible desktop companion I am now less ashamed to admit that I miss?
Preservation of Web Resources Part 1
Kevin Ashley's genuine interest and extensive knowledge enabled him to deliver an information-packed and thought-provoking talk. The first in a series of workshops is as much about thinking as dialogue, and Kevin's talk certainly delivered on the former. He acknowledged that practitioners are faced with a variety of barriers – real or perceived – such as a lack of policies and procedures, or lack of resources. Web preservation is complex and there's no obvious model to follow. Practitioners need guidance and support. By 'Practitioners' he included a range of professionals, and not just Web managers: records managers and archivists being two of the most obvious. Recognising the differences between professionals, the handbook will include a quick guide to Web management for records managers and a quick guide to records management for Web managers.
The core theme of Kevin's presentation was the importance of policy, and working within existing policies e.g. retention policy. Such policies are typically generic – for instance, talking in terms of 'information' – and therefore need to be translated, or adapted, to address the preservation of Web resources. This entails all Web resources and must stand the test of time without the need for endless revision.
Prioritisation is fundamental to successful preservation – keeping everything is rarely possible. Kevin referred to what we must, could or should not preserve. This reminded me of the MoSCoW model of prioritisation: Must do; Should do; Could do; Won't do. Without policies, practitioners have little to guide their decisions about what must, should, could and won't be preserved, let alone how.
We were shown some complex conceptual graphs of Web sites sourced from Martin Dodge's Cyber Geography Research Web site at the University of Manchester . I'm doubtful of the value of sitemaps for large Web sites and I wasn't quite sure about Kevin's suggestion that our selective archiving of Web sites should include an archive of its sitemap. My doubt stems from the lack of standards when it comes to sitemaps, especially large sitemaps whose merits are usually artistic rather than practical. Oddly enough, the Cyber Geography Research pages are themselves 'preserved' - updates are no longer being made. But Kevin's underlying point is valid: we do need – in some way - to capture an accurate summary of a Web site as a whole.
Finally, Kevin outlined the structure of the handbook which in addition to guides for records managers and Web managers, includes: assessing the institution's technical capacity and ability to implement; understanding who wants what preserved and why; the tools available; how to manage preserved resources; and implications of change.
Questions focused on which media the handbook should be provided in e.g. wiki, print, etc; how we recreate the 'Web experience' of the time; ensuring the handbook remains pragmatic; and whether or not a central preservation service would be a better approach.
Challenges for Web Resource Preservation
Marieke Guy summarised a long list of barriers that reminded me why Web preservation is 'over there in a dark corner': plenty of nettles to be grasped.
She pointed out the tacit assumption that digital media are intrinsically preserved by the very nature of the medium. After all, the bits and bytes themselves don't decay even if the computing hardware is replaced. Moreover Web managers operate under pressure and are busy with the here and now. The long term is overshadowed by the short term as the urgency to publish dominates the day.
Web sites are pervasive and touch all parts of an organisation. Responsibilities are unclear. Who should make the decision as to what gets preserved and who should do it? The publisher? The author? The copyright holder? Who decides? The Web is complex, transient and dynamic, and its getting bigger with the forthcoming introduction of new TLDs by ICANN. Technology is constantly changing and Web pages are composed of complex parts – images, movies, text, etc. There's the need for understanding between the relevant practitioners with different nomenclatures, let alone regulations and procedures.
Marieke referred to the cardinality of Web pages – one version of a page (on a server) but thousands of versions on users' PCs and therefore prone to subtle variations due to the different browsers, screen settings etc. Should the source be archived or the end-user's experience of that source? What is a Web site? What is a Web page? Marieke quoted Clay Shirky: "Are we preserving the bits or the essence?".  To what extent does preservation of Web resources require preservation of the underlying hardware and software? Software emulation was suggested as a possible solution to this. Two significant categories of barriers are legal issues (e.g. IPR) and Web 2.0 (which will be looked at in more detail in the second workshop).
In summary, Marieke stressed the need for ownership, and called for recognition of Web preservation as 'our problem' – practitioners need to grasp the nettles and this requires an interdisciplinary approach.
Alison Wildish and Lizzie Richmond (University of Bath) offered a frank insight into the relationship between the archivist (Lizzie) and the Web manager (Alison). No doubt the situation is common to many organisations where such roles exist: if the relationship exists at all, it's probably no more than acquaintance. Their presentation was delivered in the form of a smooth double act which was admirably compelling given the topic. By acknowledging their differences, the benefit of establishing a professional working relationship between disparate professionals was on show. This delivered a positive exemplar for delegates to try and recreate locally.
Lizzie stressed her lack of IT knowledge and skills, and her resistance to it. Her primary focus is the archive. Alison stressed the user experience and meeting the needs of users – the Web as a medium of communication. From what I could gather, their relationship existed because they'd been asked to deliver the presentation, but it was obviously working well. The value of archives was illustrated by Lizzie through a series of scanned images of the front cover of the University of Bath's undergraduate prospectus over the last 10 years. Seen in rapid succession, the wider context or 'spirit of the times' was revealed through variations in graphic design. This was like watching a succession of images transform into a single moving image – almost a time-lapse film.
Through the dual speaker format, some of the barriers mentioned by Marieke were revealed, but in context, which worked well to reinforce an understanding of Web preservation for those willing to tackle it.
In closing, Alison emphasised that the distinction between the publication and the record is blurred and that a realistic and pragmatic approach is required. Significant additional resources are unlikely to be available so re-using what practitioners have available is key to successful strategies locally.
Jordan Hatcher (opencontentlawyer) stressed that he was a lawyer but licensed to practise in the US rather than the UK, and a 'techy' amongst lawyers but not amongst techies. His contribution was a calm and effortless guided tour of the legal landscape.
As well as applying to published (or live) content, UK law - of course - applies to archived content. Some of the areas relevant to Web preservation were highlighted: the Data Protection Act, particularly the fifth principle concerning the retention of personal data; potentially libellous content, especially with respect to Web 2.0 content; adult content and/or obscene content; Public Order Act; Terrorism Act; Freedom of Information Act; intellectual property rights, including patents, trademarks, patents and copyright; and third-party Web sites, especially for hosting services.
During his run through these legal areas, Jordan emphasised the importance of licences in the context of copyright, and introduced the Creative Commons  as an model to follow. He also discussed third-party Web sites which, he warned, are typically commercial enterprises whose primary focus is maximising profit and minimising liability. He encouraged practitioners to refer to terms of service, a point illustrated with reference to FaceBook, which permits copying content for personal non-commercial use, but not redistribution to others. The implication for those first posting original work in FaceBook is obvious.
While the legal aspects may appear daunting, Jordan's message was simple: 'Don't Panic'. He advocated a risk management – rather than risk averse – approach ie avoid, reduce, accept and transfer risks.
Questions focused on removing content from archives for legal reasons; legal constraints on a centralised archiving services; directories; and permitting and/or agreeing to Web sites being archived.
Preservation of Web Resources Part II
Ed Pinsent's provided us with a succinct presentation containing a great deal of valuable content. He advocated a labour-saving and pragmatic approach.
Ed ran through the types of Web resources that would be candidates for preservation, from prospectuses through to the content of e-learning systems. He also emphasised the need to exclude certain types of content, echoing the MoSCoW model. Top of Ed's list were the sorts of resources that were already well managed e.g. institutional repositories and digital libraries (including image catalogues).
He then ran through some of the drivers for Web preservation:
- Uniqueness of resources
- Audit requirements
- Financial value of resources
Again, the message that Web preservation requires teams of practitioners was communicated, and Ed warned against practitioners 'going it alone'. He suggested the following strategies:
- Protect your Web site in the short to medium term. If there are digital asset initiatives in-house, try to factor the Web site into them, either in terms of its production/provision, or in terms of archiving.
- Manage your Web site with particular attention to retention and disposal.
- Preserve your Web site either in whole or part.
In summary, Ed advocated: identifying the resources to be preserved i.e. what is in scope; don't go it alone; choose the approach that is appropriate to constraints, particularly with regard to resources.
ReStore: A Sustainable Web Resources Repository
Arshad Khan of the National Centre for Research Methods raced through a description of ReStore which seeks to preserve Web resources from the hardware up. ReStore stems from the Economic and Social Research Council (ESRC)'s suggestion to provide a platform that would allow Web resources resulting from ESRC-funded projects to be preserved beyond the end of the project's lifetime. Importantly, ReStore is currently a prototype with a view to a sustainable service to which Web resources can be migrated.
Currently, ReStore is using case studies to develop the proof of concept. From them the standards constraining the way Web sites are produced and provided can be developed. This is important because ReStore is recreating the technical environment to provide a new home for Web resources, dynamic or otherwise.
Arshad revealed what appeared to be a resource-intensive approach which required analysis followed by migration. This approach takes preservation close to the extreme of recreating a Web resource's technical environment in order to provide the resource in a sustainable manner. The details of operating systems, databases, scripting languages, and a whole host of technical matters are investigated in detail as part of the process for accepting a resource for preservation.
The basic premise is that this costly, resource-intensive approach is justified in the case of specific resources that are of sufficient value.
This – the first of three workshops – emphasised the importance of the archive to effective Web management. Although it may not be apparent, the absence of Web resources from the archive is having a detrimental impact on the here and now, and it can only get worse.
We can be confident that the archive is the responsibility of the archivist; the Web site the responsibility of the Web manager. However, Web resources which should be in the archive, and under archivists' control, are not. This creates a significant additional burden for Web managers with an ever expanding Web presence to manage, and a dilemma for the archivists in that a significant proportion of the archive is no longer under their control. For the organisation, and especially the records manager, this introduces a variety of risks, such as the persistence of personal data without good reason.
Once published, a Web resource will be retained – i.e. kept in use – until it is either deleted or archived. The default position of retention is not tenable. Existing policies – such as an organisation's retention policy – need to be translated or adapted to guide the management of these processes. Moreover, this requires a virtual team of practitioners – especially the archivist, records manager, IT manager, and Web manager – to develop and implement them. Web preservation should be part of operational work and not a project in and of itself. Rather, the project is to embed Web preservation into operational work.
The challenges are significant, especially in terms of how to preserve Web resources. No doubt the institutional repository will play a role. Arguably, the absence of a solution to the preservation of Web resources leads to either retention or deletion, both of which carry risks. The workshop's core message to practitioners was therefore to start building an internal network amongst relevant practitioners as advice and guidance emerge.
My thinking about this matter was certainly stimulated and I look forward to the next two workshops, and the handbook that will result. Web preservation is an issue which was always important but now grows increasingly urgent.
- 404 is the error code generated by Web servers when a page cannot be found.
- 403 is the error code generated by Web servers when access to a page is forbidden.
- JISC-PoWR http://jiscpowr.jiscinvolve.org/
- JISC PoWR Workshop 1: 27 June 2008 http://www.archive.org/details/JiscPowrWorkshop127June2008/
- Cyber Geography Research http://personalpages.manchester.ac.uk/staff/m.dodge/cybergeography/
- Clay Shirky's Writings About the Internet http://www.shirky.com/
- Creative Commons http://creativecommons.org/