Preservation and Archiving Special Interest Group (PASIG) Fall Meeting
I had managed to miss the previous two PASIG (Preservation and Archiving Special Interest Group) meetings, so was delighted to find myself finally able to participate by attending the Fall meeting. Conveniently the event was arranged to follow immediately the SPARC Digital Repositories meeting , also held in Baltimore, and which I also attended.
PASIG is a group sponsored by and centred on Sun Microsystems (Sun) which is a prominent vendor of data storage hardware and which is building a new business around systems to support digital preservation and archiving.
The event was held in the brand-new Baltimore Hilton which made for very plush surroundings and a very comfortable stay for those of us who had booked a room in the same hotel. There were 126 delegates listed, but I estimate that rather fewer actually made an appearance - although as people came and went it was difficult to be sure about numbers.
Rather unfortunately, a day or two before this meeting Sun had announced an intention to lay off 18% of its workforce. This was mentioned several times in conversations over coffee with other delegates who were concerned about Sun's viability as a business. In addition to this, Sun has until recently sold a hardware/software package for a high-performance archiving solution called 'ST 5800 Honeycomb'. At the meeting it was announced that this product was being 'end-of-lined'. A significant proportion of the delegates had purchased Honeycomb and Sun staff felt obliged to apologise to them at the meeting. Within this somewhat inauspicious context, Mike Keller (University Librarian, Director of Academic Information Resources, Stanford University) gave an introduction to the meeting, suggesting that we should not allow 'socio-political issues' to dominate the discussion and that 'in these troubled times we need to concentrate on the technical'.
Continuing in this vein, Ernie Ingles (Vice-Provost and Chief Librarian, University of Alberta) regretted the 'bad news about Sun', declaring that 'he wanted to support this great company', and 'that all would be well'. Ernie emphasised that we should 'take a longer view' - a reasonable position, perhaps, for a group focussed on digital preservation. He also suggested that '400 years from now, history will look at PASIG and the wonderful work it did to to advance the state of digital preservation'.
With so much interesting material to discuss, the PASIG delegates went along with the suggestion to 'concentrate on the technical' and the wider context of the troubled economic climate was put to one side for the most part.
The format of the next two-and-a-half days was a intensive series of (mostly short) presentations from a wide variety of speakers. It is not possible to describe all of the presentations, and I have had to be brief about those I have described. However, most of the presentations' slides are available for further study and where this is the case I have given URLs in the references.
19 November 2008: Morning Session
This session consisted of a few presentations from speakers concerned with looking at significant trends in preservation and archiving, including the following:
- James Simon, An Open Preservation and Archiving Architecture (claimed to support 'most if not all of the ST 5800 features and more') 
Major Trends Overview
Martha Anderson, Director of Program Management, NDIIPP, U.S. Library of Congress
Martha emphasised the need to preserve 'practice' as well as data, arguing that even if tools and technology change, the thinking behind them should be preserved. She characterised the Library of Congress's approach to preservation which considers different perspectives on preservation, looking at 'today, tomorrow and forever'. Intriguingly, she invoked Philip K. Dicks's The Preserving Machine as required reading for people involved in digital preservation! She presented a slide which I thought was particularly useful - number 7 in her presentation  - which details how NDIPP categorises content into domains.
Storage Technology Overview
Chris Wood, Storage CTO, Sun Microsystems
Chris gave a comprehensive overview of the current state of data storage technology, with some extrapolations indicating likely future trends. If you are specifically interested in this sort of thing then Chris's presentation  should be a useful resource - the last half-dozen slides outline some emerging technologies in storage. Chris went on to describe some early plans for a new 'T10 object-based' storage system - a system which is completely abstracted from hardware, the details of which were, unfortunately, subject to a non-disclosure arrangement at the time. Suffice it to say that while the appeal of the recently deprecated Honeycomb system was due to the fact that it bundled dedicated hardware and software, this meant that the hardware and software were, to a degree, mutually dependent. The appeal of the new approach is the fact that the software will be able to run over a variety of hardware.
Chris also presented some strong views about functions which he believes are not appropriate to an archiving system, asserting that 'de-duplication is not a function of archiving. Neither is lossy compression, or transcoding'.
Much of his presentation was concerned with the pros and cons of various media in use today (disk, tape, optical, solid-state). He predicts an increase in the use of solid-state or 'flash' storage, but also that tape and disk will remain viable for some time to come. Tape endures as a viable storage medium for many applications as it is still relatively cheap both to manufacture and to operate (in terms of power consumption). Blu-Ray optical storage is also a good bet for the medium term according to Chris. He pointed out the effect of wide consumer-device adoption driving down the price of technologies, citing the example of solid-state memory in particular.
One very interesting point Chris made concerned the relative costs of procuring and running hardware. He predicts that somewhere between 2010 and 2015, the capital outlay for equipment (servers, disk arrays etc.) will be superseded by the running costs of keeping a typical item of such equipment both powered up and cooled down.
19 November 2008: Afternoon Session
This session was predominately concerned with preservation, and we were treated to a series of short presentations reporting on projects in this area. Most of these were short, and included the following:
- Sandy Payette, Fedora Commons (some interesting ideas on the 'emergence of infrastructure'. I noted with interest that with version 3.1, Fedora has introduced support for SPARQL) 
- Mark Evans, Planets (a description of a data model to underpin a distributed system for preservation) 
- David Tarrant, Preserv2 (an interesting look at preservation as a process within a repository system)
- Carl Grant, Ex Libris, The Digital Preservation System and the Open-Platform 
- Brad McLean, DSpace 2.0 (an overview of a new architecture which is influenced by 'ORE, JCR, FRBR, RDF and Fedora') 
- Sayeed Choudhury, Blue Ribbon Task Force on Sustainable Digital Preservation and Access (an update on the progress of this group) 
- Chris Awre, REMAP and RepoMMan (an overview of two projects concerned with systematically embedding preservation into scholarly workflows in an institutional context) 
Architectural Issues in Preservation
Kenneth Thibodeau, Director, Electronic Records Archives Program, NHE, U.S. National Archives and Records Administration
I thought Kenneth was particularly interesting, talking mostly about arrangements for archiving US Government records, including email messages sent to and from the White House. He described some considerable preservation challenges, with a slide  illustrating remarkable growth in the quantity of records kept in the White House. Kenneth explained how his administration was anticipating a deluge of new material with the transition of power from President Bush to President-elect Obama. He explained the primary differences between 'documents' and 'records', offering the distinction that a document can be a stand-alone or atomic artefact, while a record is generally part of an (ordered) collection.
Kenneth also had some interesting things to say about the importance of perception in preservation explaining that, in a context of changing business requirements and technological obsolescence, expectations and perceptions can also change. He also illustrated the difficulty of designing for the future, pointing out that, 'it is only when a bridge collapses that you actually recognise what it was you needed to know when you designed it'.
Overview of Repository Needs and Directions
Tyler Walters, Georgia Institute of Technology
Tyler gave a nice overview of progress under three themes: 'exchange' (harvesting & interoperability), 'infrastructure' and 'synergies'. Under the 'exchange' theme, he was especially enthusiastic about ORE (Object Reuse and Exchange), asserting that 'metadata is important, but source content is what it's all about!' SWORD (Simple Web-service Offering Repository Deposit) also got a mention here, notably in its role of facilitating interoperability between OJS (Open Journal System) and Fedora and DSpace. Tyler also described an opportunity for LOCKSS to work with harvesting technologies and SWORD to provide 'distributed preservation' via private LOCKSS networks. This would require the development of extensions to repository software to support automatic harvesting and distribution of content, with SWORD providing the mechanism for what he termed 'crash recovery' of data.
Moving on to infrastructure, Tyler briefly outlined a number of initiatives in the area of format management, before addressing issues around storage. Describing a 'tiered storage layer', Tyler emphasised the importance of 'abstracting the storage layer from the content-organising features of the repository'. Not for the first time at the meeting, the hot topic of 'cloud-storage' got a mention, and Tyler pointed to the eScience project CARMEN as an example of a repository using cloud-storage.
Under the theme of 'synergies', Tyler mostly concentrated on what he called 'convergence', citing the collaboration between DSpace and Fedora in particular. He picked out examples of collaborative development in areas such as the integration of repositories with authoring tools and shared storage, showing how they are leading to a shared vision of common, low-level infrastructure supporting 'modularised functionality' .
The day ended with an excellent dinner at the Hilton. The waiting staff were clearly very newly trained, and the meal was punctuated on one occasion by what sounded like a tray of dishes being dropped down an escalator, but this did not at all detract from a very pleasant conclusion to day one.
20 November 2008: Morning Session
The morning of day two was devoted to digital curation, in response to feedback from delegates at previous PASIG meetings. This session consisted of a rapid succession of briefings from a wide range of parties, including:
- Patrick McGrath, The UC Berkeley Mediavault Program (a Mellon-funded museum collection management system called 'Collectionspace' which is designed according to SOA (Service Oriented Architectural) principles) 
- Bob Rogers, SNIA Data Preservation and Metadata Projects (an overview of tools for classification and policy management, and an introduction to SIRF (Self Contained Information Retention Format))
- Helen Tibbo, DigCCurr I & II: Lessons Learned from Building a Digital Curation Curriculum (an introduction to the DigCCurr project in which digital curation was characterised as being about 'maintaining context over time - essential for reuse' as well as being a 'young and evolving field')[17 ]
Research Data Curation Trends
Lucy Nowell, Program Director, Office of CyberInfrastructure, U.S. National Science Foundation
Lucy gave a brief, very high-level strategic and political view of data curation in the US. She pointed out that the NSF has had a long-standing policy which has just not been enforced. There is now a new resolve to enforce it. The US Congress has made it known that it expects open access to publicly funded data. The NSF has a CyberInfrastructure vision for 21st century discovery, which is focussed on 'community-based knowledge representations'.
Contouring Curation for Disciplinary Difference and the Needs of Small Science
Carole Palmer, University of Illinois, Urbana Champaign
Carole talked about profiling complexities and differences both within and among disciplines in 'small science'. She showed a slide contrasting crystallography (CIFs) with data structures/formats from geobiology . Interestingly, she has also been analysing the discourse of the scientists themselves in interviews, and briefly showed some 'interview word-clouds'.
High-Level Storage and Data Management Trends
Raymond Clarke, Enterprise Storage Specialist, Sun Microsystems, SNIA Technical Board Member
Raymond began by explaining that backing up and archiving data was a pressing issue because the 'history of data growth is exponential'. Amusingly, he illustrated this by contrasting the size of various pieces of important text: Pythagorus' Theorum (24 words), the Gettysburg Address (286 words), the EU regulations on the sale of cabbages (26,911 words).
Focussing on issues and challenges related to storage hardware, Raymond characterised such hardware as tending to become obsolete within 2-5 years and maintaining backwards compatibility for, remarkably, only 'n-1 years'. He claimed that we should treat hardware migration as inevitable and, consequently, that we ought to plan for it.
Raymond offered a table (slide 7 in his presentation ) contrasting the characteristics of systems supporting digital archiving and data protection (or 'backup'). Finally, I found the slide on the 'Demands of a New Archive Reality' (slide 8) particularly interesting, wherein Raymond suggests that a new dimension has been added to the problem of archiving massive datasets - that of how to 'search petabytes of data from the edge'.
20 November 2008: Afternoon Session
Day two concluded with a series of presentations concerned with the use of Sun technologies in architectures to support repositories, preservation and archiving. The presentations included:
- Reagan Moore, Building a Reference Implementation of a Preservation Environment (a description of a 'starter-kit for a preservation environment' which, among other things, introduces the notion of a 'chain of custody')
Infinite Archive System
Keith Rajecki, Education Solutions Architect, Sun Microsystems, Inc. & Judy Leach, Storage Solutions Architect, Sun Microsystems, Inc.
The latest offering in the area of integrated storage solutions from Sun Microsystems was outlined in this clear presentation. Describing what was called an 'intelligent tiered archive', this new product offers a hardware stack of different storage media in one 'factory-configured' system. The attraction of this approach is that a single integrated system can offer different storage solutions to meet different requirements.
As there had been plenty of discussion at PASIG about the variable costs and efficiencies of the very many options available for data storage, such an integrated yet flexible solution seemed attractive. Although I could not stay for an extended meeting with Sun which answered questions about this new product, I noticed that many of the delegates did, suggesting that Sun might have hit on an approach which is attractive, at least, to the PASIG delegates .
Gary Wright, Digital Preservation Product Manager, The Church of Jesus Christ of Latter-Day Saints and Randy Stokes, Principal Engineer, FamilySearch, The Church of Jesus Christ of Latter-Day Saints
Gary and Randy spoke about 'preserving the heritage of mankind'. Specifically, they described FamilySearch - the world's largest genealogy database, which was begun 87 years ago and now contains 3.5 billion images on microfilm. They have recorded 10 billion names which they estimate is '10% of everyone who has ever lived'.
FamilySearch is an ongoing digitisation project creating a million new records every week. They have done much to streamline this process, including 'crowd-sourcing' the transcription service: for example, a scanned image of a hand-written birth-certificate is shown to two remote volunteers - if they agree on the transcription then this is entered into record, else it is shown to a third volunteer and so on until a consensus is reached. The FamilySearch project team aspires to creating an infrastructure which allows it to scan and process a billion images every year .
The second day culminated in a drinks reception and meal in Baltimore's Museum of American Sport, conveniently situated next to the hotel. As ever, the catering was excellent - and we were able to wander the museum, drink in hand, attempting to make sense of what passes for sport in the United States.
21 November 2008: Morning Session
The third and final day of the PASIG meeting was actually a half-day, with a few presentations culminating in a summary keynote from Clifford Lynch. Presentations included:
- David Gewirtz, Considerations in Implementing a Permanent Access Solution 
- Thomas Ledoux, Implementing the SPAR architecture (a description of the Système de Préservation et d'Archivage Réparti, based on an RDF platform called Virtuoso) 
Final Summary Keynote: How the PASIG fits into the Global Project Landscape
Clifford Lynch, Executive Director CNI
Clifford delivered a characteristically thoughtful summing up, noting that PASIG was focussed on technology rather than policy. He pointed to an interesting opportunity in the area of scientific/data archiving. Most such data are generated by apparatus, are then fed into an intervening 'mystery' system, after which they are archived. Clifford wondered if we should get closer to the apparatus, providing short-term storage as a compromise, noting that we should be open to such compromises.
Clifford suggested that there had been more discussion about federation this year, concluding that economies of scale were attractive in the present economic climate. He noted that cost-saving strategies were preoccupying more people, citing the evidence of facilities moving large data centres to places where electricity is cheap and plentiful. He also pointed out that ingest is still expensive and that what he termed 'money-pits' abound.
Most interesting to me was the way in which Clifford picked up on and was critical of much of the talk about 'cloud' storage and computing, declaring that there had been a lot of 'scary hand-waving about clouds'. He suggested that one problem with clouds is that they are opaque, leading to the sort of service level agreement (SLA) which might as well just say 'Trust us!'. He noted that Amazon's feted S3 service has the potential for severe performance problems in data preservation scenarios, and that there has been no systematic study of error/data loss from such services. He called for more empirical testing. He also outlined some examples of the sort of parameters a cloud-based preservation system would need to offer, including: the number of copies to be kept and geographical constraints (positive and negative), such as how many continents to store the data on, or even which countries to avoid using.
Amusingly, he asked how one would write an SLA for such a storage service, offering as an example: 'We promise to lose no more than n bits in a year.....', with a fictitious customer response of: 'We just lost most of the history of physics, but we'll be getting free service for the next year as compensation'. With such storage services, quantified risk management doesn't work!
Clifford sounded another note of warning, suggesting that we need to consider the provenance of data more than we have done so far. He implied that this became progressively more difficult with the increasing complexity of the architectures being developed to support preservation. He singled out the particular difficulty of maintaining the application of licensing to data as they moved from one system to another.
Other Meetings and Discussions
It is worth noting that there was plenty of opportunity for networking, and a number of ad hoc or loosely planned meetings occurred throughout the two-and-a-half day event. I made a number of useful contacts and enjoyed many engaging conversations.
The hospitality arranged by Sun was very good and contributed greatly, I think, to the success of the event. At times the event was almost overwhelming, with a very dense programme of speakers, but with copious notes and well-organised access to the presentation materials, I have been able to follow up lines of enquiry after the event.
It did occur to me at the time that, while it is reasonable and appropriate for PASIG to concentrate on the technical at the expense of, as Mike Keller put it, the 'socio-political', we could not be blind to the fact that the current period of significant, global economic upheaval will have an effect on digital preservation. Preservation can be an expensive endeavour, and budgets are going to be tighter than ever before. Nevertheless, from what I saw and heard, technical innovation in this area continues apace. I, or someone from UKOLN, will certainly go to future PASIG events should they continue in this vein.
- Web site for PASIG
- Web site of the SPARC Digital Repositories Meeting 2008
- James Simon's presentation
- Martha Anderson's presentation
- Chris Wood's presentation
- Sandy Payette's presentation
- Mark Evans' presentation
- David Tarrant's presentation
- Carl Grant's presentation
- Brad McLean's presentation
- Sayeed Choudhury's presentation
- Chris Awre's presentation
- Kenneth Thibodeau's presentation
- Tyler Walters' presentation
- Patrick McGrath's presentation
- Bob Rogers' presentation
- Helen Tibbo's presentation
- Carole Palmer's presentation
- Raymond Clarke's presentation
- Reagan Moore's presentation
- Keith Rajecki & Judy Leach's presentation
- Gary Wright & Randy Stokes's presentations
- David Gewirtz's presentation
- Thomas Ledoux's presentation