The third meeting of Sun's Preservation and Archiving Special Interest Group took place in San Francisco in May. The event, the third PASIG meeting in the last year, drew around 180 participants from Australasia, Asia, Europe and North America to discuss a broad range of issues surrounding digital repositories. Presentations ranged from geographically or community-themed high-level perspectives of repository- related activity, through to detailed technical analysis and reports of development activity at an institutional or project level. The diversity of presenters, representing Higher Education institutions, national libraries, repository communities, and organisations such as the Shoah Foundation, gave a fair indication of how far preservation issues have moved away from being a purely 'library' concern. Michael Kellor, University Librarian and Director of Academic Information Resources at Stanford University, and one of the founders of PASIG, summarises the PASIG purpose thus:
'The thrust of the PASIG written large is the development, elaboration, and operation of digital archives for preservation and access at individual institutions, reducing dependence upon so-called third parties and increasing the scope and rate of gathering and protecting significant digital objects for the long term by virtue of multiple, simultaneous, and perhaps even coordinated efforts.'
Although the conference touched on issues surrounding open access, this was not its primary focus; other fora exist where these issues are discussed. The focus of the event was very much on networking and sharing best practice. Key topics covered included tiered storage, data management and digital asset management (DAM), open storage, data curation, immersive technology, repositories and federated archives, and Web 2.0 services. As the discussion unfolded, clear themes emerged around software and hardware architectures to support preservation, and their relationship to both the enterprise as a whole – particularly, but not exclusively in Higher Education - and emerging national and international preservation infrastructure. Highlights of a limited selection of the PASIG presentations follow, the complete programme and presentations are available online .
High-level Trends: Cliff Lynch (Coalition for Networked Information)
Cliff Lynch's high-level trends presentation elaborated key technological, economic and social/policy questions, issues and challenges involved in broadening support for preservation activities.
Clifford Lynch's Address
It is possible to download the MP3 sound file of this address and play with a suitable application.
Cliff's survey began with issues surrounding storage. Although still focussed primarily on cost, he argued, ideas around storage were undergoing rapid evolution. A range of technical issues surrounding storage had been largely solved; in the past, discussions about storage used to be obsessed by reliability. It was now recognised that grid storage solutions guarded against both point failures and geographical disasters, although we still lack a rigorous sense of 'how much distribution is enough'. It was also the case, he argued, that we had no sense of the impact of geographical boundaries, and how the cloud overlays those boundaries. Sensitive data cannot casually cross those boundaries, and, in this context, for some jurisdictions, encryption is not enough.
Cloud storage solutions are not primarily driven by preservation requirements, but by the need for scalable information technology solutions to support day to day operations. This has impact on the type of service level agreements available, and those that might be required for preservation purposes. Some have looked at levels of service available from major providers and wondered how they might be integrated in a preservation strategy. They struggle with a pragmatic set of behaviours that appear to be robust, and assurances that appear sound, but contain expressions which essentially add 'our lawyers said to tell you, you might lose it all tomorrow'. Engineering this into a preservation solution with a defined set of properties, Cliff added, was 'really hard'. He posed two questions of the audience: Do we give up on service level agreements in an age of cloud computing?; Or will we see another generation of cloud services that come with the type of SLA (Service Level Agreement) we might like to see?
Whilst noting that the issue of the bandwidth was not likely to be an issue for research libraries embedded in major universities, Cliff noted that it was worthwhile remembering, as cloud solutions scaled out, that there is an implicit connection between the cloud and available network bandwidth.
In moving to the political and social aspects of preservation, Cliff indicated that those involved in preservation activities, such as LOCKSS and Portico, had correctly emphasised the need to work at significant scale with content producers and publishers. This, he commented, was not a 'hard sell' in the abstract; publishers need a credible story to tell about preservation. Preservation, however, is not solely concerned with the scale of collections, but also their nature. In the past, great research collections had not been built simply by collecting at scale, but by collecting the personal and organisational papers of key figures. This aspect, Cliff argued, should be taken forward in an era of changing personal behaviour, where more and more of our personal lives has moved to the network. This brings a new set of problems, both technical and ethical, when we decide we want to acquire this material as part of collective and social history. It was interesting to note, he added, a recent discussion in special collections circles, around the ethics of obtaining a series of obsolete laptops and turning loose graduate students with modern forensic tools to analyse the un-erased space on their hard disks.
In skimming the spaces, future biographers might seek information, in Flickr, Amazon purchasing histories or blogs; Cliff also noted the considerable confusion in the mind of the public about these services. To take the example of Flickr, he posited that despite the existence of Flickr as a sharing service, people absolutely believed it to be a preservation service, and might not even retain their own copies of photographs. This type of behaviour should be taken into account when we seek to preserve a digital record.
Cliff closed by commenting that he observed a stronger conversation regarding reuse in the preservation community, which he regarded as a positive sign. This was particularly the case in the world of research data, and moved the discussion from abstract to very concrete ground. It was still remarkably easy to fall into old models of physical access to archives; 'I've collected this box of material and I'm going to spend a week looking through it', whereas the future was much more likely to see a researcher deploy 'significant computational resource on an entire archive as a way of refining something out of it'.
The Stanford Digital Repository: Tom Cramer (Stanford University)
The Stanford Digital Repository Project, managed by Stanford Libraries, made an early identification of three broad categories of data requiring preservation;
- Library digital content
- Institutionally generated digital content, including research data, learning objects, and other key institutional information
- External deposited content
The project's main objective was to support long-term preservation of digital content through a secure, sustainable trusted system. The initial project focus on library content provided an exemplar area of work under the direct control of the project sponsor. A number of principles were established at an early stage:
- Any system will be largely hidden from end-users. Interfaces facilitating access will be embedded as close to user workflow as is practicable.
- The system would be as simple and as modular as possible, facilitating flexibility and adaptation to new circumstances where necessary.
- Multiple copies on multiple geographically spread media
The Stanford Digital Repository will serve as a common preservation infrastructure. Stanford's experience leads them to believe that it is impractical to have a 'single' repository that is optimised for both preservation and access.
Oxford Digital Asset Management System (DAMS): Neil Jeffries (University of Oxford)
Oxford's DAMS currently captures e-theses, e-prints and working/conference papers with a view to expanding this in the near future to digitised books, electronic ephemera and manuscripts. DAMS is best described not as being a single system, but a toolset built out from a Sun Honeycomb/ST5800 hardware infrastructure, Fedora-based repository services, with Mass Digitisation Ingest Components (MDICS) providing ingest services from the output of the Google Library Project.
Neil noted the significance of the work of JISC as a key driver and shaper of the UK landscape in the United Kingdom. Oxford's key concerns are systems integration beyond 'the repository', maintaining a diverse toolset to support preservation, and expanding the scope of repository activity over time to include new content types.
National Digital Heritage Archive: Graham Coe (National Library of New Zealand)
Graham presented an ambitious project to collect, make accessible and preserve New Zealand's digital heritage. The decision was taken early in the project to purchase a commercial repository solution, as in-house development or open source was viewed to be of higher risk. The New Zealand National Library has partnered with the vendor to develop the product further, and has worked closely with a peer review group made up of international Higher Education institutions to this effect.
The programme took a phased approach to implementation; the first phase of the project will go live late in 2008. The Digital Preservation System will eventually integrate with a wide variety of existing systems, including those providing collection management, reporting, and resource discovery and delivery services.
Storage and Data Management Practices for the Long Term – Raymond A Clarke (Enterprise Storage Specialist, Sun Microsystems) Legal compliance, security, and the needs of both business and research, together with requirements to keep increasing quantities of digital data accessible and discoverable are creating massive complexity for a range of organisations. Raymond Clarke argued for a holistic approach to supporting the long-term retention and preservation of digital information; conventional strategies simply did not scale in the face of the data and information deluge.
Raymond gave examples of best practice for retention and preservation:
- Achieve early and inclusive stakeholder consensus on classification.
- Establish retention periods on information and delete 'expired' information. Free up space – only store what is required.
- Set policies for audits – who has used the information? Establish when, where and how to ensure integrity.
It would be wrong to think PASIG is an extremely formal event, driven entirely by one-to-many presentations. PASIG meetings tend to be practically oriented, and centre on networking opportunities, with space to drill down into issues with others who face similar practical problems. In addition to the main conference sessions, five working groups met several times during the PASIG meeting:
- Long-Term Storage and Data Migration 
- Preservation 
- Enterprise Repositories and Federated Archives 
- ST5800-Based Architectures 
- Research Data Curation 
PASIG is a cross-community event, although, naturally enough for an event organised primarily by Sun Microsystems, there is a specific focus throughout on Sun hardware, software, and services. The three principle open-source repository framework communities – DSpace, Fedora and ePrints - were represented, and contributed high- level perspectives of their roadmaps to the conference. A lunchtime meeting, called by the Executive Directors of Fedora Commons and the DSpace Foundation, discussed the potential for collaboration between the communities in the light of what appeared to be shared objectives and understanding. There was significant support for further dialogue and practical collaboration. Sandy Payette of Fedora Commons, and Michelle Kimpton, of DSpace followed the meeting with a joint statement outlining the approach both communities would follow in coming months . Developers from both communities have already met to discuss the shape of this collaboration in more detail.
If one had to select central themes emerging from the May PASIG meeting, one such theme would almost certainly be best captured by the title of the presentation by Stephen Abrams and John Kunze of the California Digital Library: 'Preservation is not a location'. Stephen and John, amongst others, advanced the perspective of preservation as a series of distributed services rather than a 'designated repository'. The success of this approach is critically dependent, however, on the systems and environment at the level of the academic library, institution, or consortium (whether national or otherwise) being hospitable to the integration of those preservation services. There are strong signals, in the work of the JISC, the NSF and the Australian National Data Service, that such thinking is being factored into national strategic approaches. It seems less certain that this is currently occurring at anything other than a small minority of Higher Education institutions. This speaks to a need to broaden dialogue, and ensure that preservation and other issues conventionally assigned to 'library' or 'repository space' are adequately integrated into emergent thinking around enterprise architectures for Higher Education. It also clearly indicates the need to include a considerable range of software producers – both proprietary and open source - in that dialogue.
In the course of a year, PASIG has become a valuable forum for those concerned with how these infrastructure pieces might fit together at the academic library, enterprise and national level, and what policies should be associated with them. PASIG is the result of close collaboration between Stanford University and Sun Microsystems. Michael Kellor of Stanford University and Art Pasquinelli of Sun Microsystems, the principle organisers, should be congratulated for the openness, energy and vision they have brought to the organisation of the event, and the PASIG itself.
- May 2008 PASIG presentations are available from http://events-at-sun.com/pasig_spring/presentations/
- Long-Term Storage and Data Migration http://www.sun-pasig.org/wiki/index.php?title=PASIG_Long-Term_Storage_and_Data_Migration_Working_Group
- Preservation http://www.sun-pasig.org/wiki/index.php?title=PASIG_Preservation_Working_Group
- Enterprise Repositories and Federated Archives http://www.sun-pasig.org/wiki/index.php?title=PASIG_Enterprise_Repositories_and_Federated_Archives_Working_Group
- ST5800-Based Architectures http://www.sun-pasig.org/wiki/index.php?title=PASIG_ST5800-Based_Architectures_Working_Group
- Research Data Curation http://www.sun-pasig.org/wiki/index.php?title=PASIG_Research_Data_Curation_Working_Group
- DSpace/Fedora initial discussions on joint collaboration http://mailman.mit.edu/pipermail/dspace-general/2008-June/002034.html