Email Curation: Practical Approaches for Long-term Preservation
This workshop organised by the Digital Curation Centre (DCC)  brought together librarians, archivists and IT specialists from academic, commercial and government sectors. Email is a major universal communication tool. It’s used for both assigning responsibilities and for decision making. People using email have differing perspectives and expectations from those who manage the infrastructure. While there are common desires for preservation no one solution fits all circumstances.
Day One: Emails as Records
Seamus Ross, DCC, chaired the first session. He began by introducing the chapter in the DCC Digital Curation Manual  on curating email.
Maureen Pennock of the DCC, and author of the same chapter, talked about what an email actually is, and how it can contain information or can act as the ‘envelope’ with the main message carried as an attachment. From a management perspective an email contains much useful technical management data as the header to the message. Yet curation is often done in an ad hoc way. It was pointed out that the presence of back-up tapes does not constitute an archive. The process of curation is one of active management through the information life cycle. The benefits of the proper curation of emails include legal compliance, but also offer efficiencies in information management in the development of a trusted corpus of corporate knowledge. A curation policy framework for the management of the life cycle of emails is essential. It allows information content to drive policy and encourages all stakeholders to collaborate in the management process at each stage of the life cycle. It is essential that preservation begins at the point of creation. Yet the long-term management of emails still represents a records management problem; the principles and issues are not that different from those of records in the analogue world.
Carys Thomas and Garry Booth from Loughborough University described a project to put email management strategies into place. The project found that both decisions and responsibilities were agreed via email. While important emails could be printed out and filed as records, staff wanted both access to, and context for their emails. Making staff responsible for the archiving task raises issues of responsibilities and the roles of individuals in the managment of information and how institutions support them. Managing email can be time-consuming and efficiency varies. Users want control over their email but also want the systems and processes to require little effort. The cost of separating ‘important’ emails from ‘trivial’ can be high. The buy-in of senior managers to the principles of email archiving is essential. Technological barriers are also difficult to overcome; an academic institution can often have a complex environment with a diversity of applications and platforms used for email. For instance how do users who work from home or with laptops synchronise and manage their emails? The need is for user education in the importance and principles of archiving email. This can be managed with a risk management approach balancing the cost of archiving everything against the risk of litigation, its consequences and the cost of having to retrieve emails.
Susan Graham, University of Edinburgh, presented a case study looking at the records management approach to preserving email. People have a strong feeling about their email, wanting control over their content but don’t want to be told how to use email. The University takes a devolved, pragmatic approach to managing email. The plan is to manage all records in a unified way, including email. It is essential to provide adequate resourcing to records management and this hasn’t always happened in the past. Email is an important means of communication and decision making for the University. The advent of the Data Protection Act and Freedom of Information Act means that emails have to be discoverable. They also need to show evidential weight, i.e. authenticity. Staff also receives training in archival management and records management policies. The support of senior managers has been crucial, as has the allocation of adequate resources. The key lesson here is that there is no easy path to follow and significant effort may be required.
Day Two: Practical Tools and Approaches
Maureen Pennock of the DCC chaired the first session of the second day.
Jacqueline Slats, National Archive of the Netherlands, noted that if the Dutch government could use email as a tool for decision-making then the time had come to take email archiving seriously. In 2000 National Archive of the Netherlands began a project to provide for the secure and authentic storage and management of email. The project tested three approaches to preservation and management: migration, XML and the Universal Virtual Computer (UVC). The project chose to use XML and believes that well-structured email is well suited to preservation using XML and that XML is a sustainable solution. XML is portable, expansible and supported by the World Wide Web Consortium (W3C). Both structure layout and context can be retained in XML. The project estimated costs for long-term management of email, including infrastructure costs such as the IT system but also personnel costs. Some of these are ‘up- front’ costs, some are one-off. Others, such as the cost of preservation intervention, will occur over time. However, any management intervention after the point of creation will be both more costly and more complex.
Filip Boudrez, Stadsarchief Antwerp, talked about the eDAVID Project and gave a practical overview. Again, the importance of email as a means of confirming or authorising action was noted. This means that authenticity and context are essential components with regard to preservation. The ‘casual’ use of language in an email means that without the context of the message ‘thread’ any one email can be meaningless. The project adopted a sound records management approach, associating all records into a single location under a common ‘series’. This approach also highlights the importance of metadata in linking individual objects into the ‘series’. The goal of the project has been to automate processes where possible and capture essential metadata while making the process of record keeping as easy as possible for users.
Jason Baron, NARA, reiterated the importance of email as a tool for both government and commercial organisations but noted that it was not always well managed. There are a number of commercial products available to archive email, but these may use proprietary formats. He made the observation that in the US legislation drivers are uppermost while in the UK key drivers seem to be IT-centred. He noted the importance of email to the Enron court case  and the extremely high cost of recovering poorly archived email. While email may be an ephemeral medium, the value of the information it carried may be equally short-lived. He talked very briefly of the Sedona guidelines for managing email . The preservation period for email may not necessarily be ‘for ever’. In some cases archiving may only require a ‘janitorial’ approach. Jason returned to his ‘Twenty Questions’ for a further discussion on managing email in the open session time.
Jeremy John, British Library (BL), talked about curating born-digital scientific manuscripts and the BL Digital Manuscript Project. He describes in some detail the digital scriptorium that the project uses to retrieve and recover digital manuscript material from computer media such as portable disks or hard drives. It was emphasised that capture of data is different to preservation, though it is a pre-requisite. While the technology is readily available, the skills required to ensure authenticity and the capture of a ‘faithful’ object are very specialised. He suggested that one risk to born-digital material was the use of projects to create collections, projects that do not have sustainable sources of funding. A preferred approach would be for organisations to commit to sustainable funding, demonstrated by resourcing each stage of the materials life cycle.
Reuse of Preserved Emails
Dave Thompson, Wellcome Library, chaired the second session.
Susan Davies, University of Maryland, illustrated ways in which emails could be reused and presented research by Adam Perer, Ben Shneiderman and Douglas Oard. The research illustrates the importance of email to individuals and the communities they represent. Email can illustrate the relationships between individuals both social and organisational. The Shneiderman email archive was examined to illustrate these relationships. By analysis of the message headers some four thousand relationships were discovered in forty five thousand emails. The pattern of these relationships can be shown in relation to time, the increasing use of technology and communication within organisations. Email is well suited to this form of research, well-structured header information allows for detailed analysis. The patterns of communication can be usefully illustrated by visualisations. Patterns can reveal significant relationships and a greater understanding of those relationships as well as providing context to email archives.
Susan Thomas, University of Oxford, talked of the Paradigm Project and some of the barriers to re-using email over time. The project collects and researches issues around personal political papers, especially email. The essential nature of email makes them of value to archival collections, as evidence of social networks and personal interactions. The project has highlighted legal, intellectual property issues in re-using email. The use of email highlights especially defamation and privacy issues. The donor of an email collection is not the only individual with an interest in how, under what situations and when material may be made available. Individual emails may need to reflect these issues and access managed accordingly. Barriers to collecting email include technical issues around the number of email systems in use, legal issues of privacy, defamation and intellectual property rights. Not all legislation hinders the collection of email, section 33 of the Data Protection Act sets out exceptions under which personal information can be made available to research. Conclusions to date suggest that early curation of email is a prerequisite to reuse.
Email is used as a means of recording action or assigning responsibility; currently many organisations do not adequately manage email for record- keeping purposes. A common theme is that back-up tapes do not constitute an archive, since it is difficult and costly to retrieve information from them. The risk of not managing email must be weighed against the cost of preservation and the risk of legal action. A distinction exists between the drivers for ‘discovery’ in managing email in the US versus a technology-driven approach in the UK and mainland Europe.
The main findings may be summarised as follows:
- Many organisations manage their IT infrastructure but not the information it handles/contains/delivers
- The burden of deciding which email is a ‘record;’ and which ‘trivial’ falls on users who may be unclear of the distinction
- Users want to use email in their own way, and choose their own tools, but do not want additional archiving tasks placing on them
- The use of clear record keeping policies, backed up by user education and with support of senior management is essential to efficient long- term management
- Retaining proper organisational records may be a legal requirement, but relying on retrieval from backup tapes is risky, expensive and difficult
- XML is currently an appropriate preservation tool for well-structured material like email
- Long-term curation of emails requires a risk management approach between the cost and effort of preservation/management, and the legal aspects of storage and reuse
- The Digital Curation Centre Web site http://www.dcc.ac.uk/
- DCC Digital Curation Manual Instalments: Curating E-mails
- BBC News Web site: The rise and fall of Enron http://news.bbc.co.uk/1/hi/business/5018176.stm
- The Sedona Conference http://www.thesedonaconference.org
- Digitale Duurzaamheid - Digital Longevity (Nationaal Archief, Netherlands) http://www.digitaleduurzaamheid.nl
- Preserving Access to Digital Information (PADI) Email http://www.nla.gov.au/padi/topics/47.html