2nd International DCC Conference 2006: Digital Data Curation in Practice
The International Digital Curation Conference is held annually by the Digital Curation Centre  to bring together researchers in the field and promote discussion of policy and strategy. The second conference in this series , with the theme 'digital data curation in practice', was held between 21-22 November 2006 in Glasgow.
Opening Keynote Address
Hans Hoffman of CERN gave the opening keynote address. The e-Science 'revolution' is being both pushed by advances in technology and pulled by demands from researchers. In order to respond to these pressures, an infrastructure is being set up that encompasses large databases or digital libraries, software, high-speed network connectivity, computation, instrumentation, and (importantly) people and training.
In the particular example of particle physics, research takes place in a global network of 7000 scientists; the annual turnover of staff is in the region of 1000 (mainly junior) scientists. This turnover raises challenges for effective information exchange, and means that procedures need to be in place to ensure experience and information is passed on to each new intake of scientists.
The data collected by instruments such as the Large Hadron Collider are useless if the configuration of the instruments is not known. Therefore, CERN has an Engineering Data Management System for life cycle and configuration management, maintaining access to the drawings, 3D models, and associated documentation for the instruments, along with test results, experimental parameters, and change histories. Not only does the physical set-up need to be recorded, but also the data workflow: the raw data needs to be heavily processed before it can be used, but unless these processes are well documented the final data may not be reusable by others.
Hans also outlined the importance of open access to scientific research outputs, and set out the ingredients for a successful virtual organisation or collaboratory. These included: a common goal, formalised in some way; clear definition of the rights, duties and deliverables associated with each partner; completely open sharing of information, data technologies, and insights between partners; quality assurance of deliverables, schedules and budgets; and enough partners to be able to tackle the common goal effectively.
Disciplines and Data
Sayeed Choudhury of Johns Hopkins University presented a paper on Digital Data Preservation in Astronomy. He described a prototype digital preservation facility being developed by the University that would take care of the entire process of submitting scholarly astronomy papers. The principle is that the Library would host the data sets, imagery, spectra, etc. underlying a paper and make it available with appropriate metadata and protocols to Virtual Observatories; it would also pass on the submitted paper to the publishers (the American Astronomical Society, the Astrophysical Journal, the Astronomical Journal, and the University of Chicago Press are co-operating in this venture), who on publishing the paper would make reference back to the underlying data held by the Library. This relieves the data archiving burden from publishers and makes the data more readily accessible for reuse.
Melissa Cragin of the University of Illinois at Urbana-Champaign presented a case study of a developing Neuroscience data repository. The study focused on interviews with nine depositors, eleven potential reusers of the data, three of the developers of the repository and two consultants. The repository was seen as a useful way of gathering project data together in one place, making it easier to cross-analyse or mine the data. It was also seen as a useful medium for scholarly communication, both informally between colleagues and formally as part of the publication process, as a source of illustrative materials for teaching, and as a source of material on which students could practise their skills. This was in addition to consideration of the preservation role of the repository.
Paul Lambert of the University of Stirling presented a paper on the challenges of curating occupational information resources. One of the major problems surrounding reuse of occupational surveys is the large number of coding schemes used for recording the data, most of which are inconsistent with each other and idiosyncratically applied. The (partial) solution applied by Stirling's Grid Enabled Occupational Data Environment (GEODE) is to mark up each data set with the encoding scheme used, employing a registry of such schemes. Where links between schemes can be made, this is noted in the registry to allow such translations as are possible. The context is also recorded, e.g. the geographical area, time period, and social group studied, and reference unit used for analysis, e.g. just individuals' current occupation, their career history, or the occupations of those in a single household. All these factors are important to researchers looking to compare their data with earlier sets.
Poster and Demo Session
The session started in the presentation room, with contributors giving a one-minute spoken advertisement for their poster or demonstration. Afterwards, contributors took up residence in the poster room to talk people through their poster or demonstration individually .
Jeremy Frey of the University of Southampton presented a paper on curating laboratory experimental data. In the context of the data life cycle he stressed the importance of recording sufficient metadata (and other aspects of good record-keeping) at the time of data creation (e.g. spectra of samples). Various tools are being developed in Chemistry for different stages of the life cycle. The SmartTea Project  has been looking at electronic laboratory notebooks, and producing software that can capture and store high-quality metadata to accompany the record of the experimental process. The R4L (Repository for the Laboratory) Project  is looking at ways to capture and store large quantities of laboratory data. The e-Bank Project  is investigating ways of exposing data (specifically crystal structures) for e-Science purposes. The CombeChem Project  is looking at blogs as a possible mechanism for converting experimental records, data and scientific discussions into a publishable format.
Frances Shipsey of LSE presented a paper describing a user requirements survey conducted for the VERSIONS Project , which attempts to deal with version control for academic documents. The findings of the study were that academic authors have to deal with large numbers of revisions, and that while most authors aim to keep copies of all or most of their revisions, many of these copies become inaccessible over time; this is sometimes due to technical difficulties and sometimes due to organisational changes. As well as VERSIONS, other projects covering similar ground are the NISO/ALPSP Working Group on Versions of Journal Articles (specifically dealing with stages from the point of submission to a publisher), RIVER (investigating what versioning means for various digital objects) and the JISC e-Prints Application Profile Working Group (applying Functional Requirements for Bibliographic Records  to versions held in repositories)[*].
Yunhyong Kim presented a paper on automated genre classification. By 'genre' is meant 'type of document', e.g. academic textbook, novel, scholarly article, newspaper/magazine, thesis. The paper compared the effectiveness of using image-based and text-based classification techniques, and found that newspapers and magazines are more visually than textually distinct from other genres, whereas theses were found to be visually similar to business reports, minutes, textbooks and novels.
Liz Lyon of UKOLN gave a keynote address reflecting on Open Scholarship. The talk consisted mostly of a tour of initiatives working in the area of open source, open access and open data, and the new possibilities that arise from blogs, wikis and 'mashed-up' Web services.
Various aspects of the publishing process were tackled, such as open peer review (implemented in different ways by Nature and PLoS One), and new forms of publishing such as the molecule database held by the UCSD-Nature Signaling Gateway . The e-Prints Application Profile was introduced as a tool for providing the rich metadata sets that are required for the discovery and reuse of resources.
The talk concluded with consideration of the inter-disciplinary nature of e-Science, which was illustrated with an amusingly over-ambitious job description for a data curator/librarian for earthquake research data, requiring specialist knowledge of at least three different fields.
Panel Session on Open Science
The three panellists were Peter Murray-Rust of Cambridge University, Andreas Gonzalez of Edinburgh University, and Shuichi Iwata of CODATA.
In his opening talk, Peter pointed out the futility of trying to guess at the original data from the imagery published in journals, and explained that open access is not enough for e-Science: data needs to be openly reusable. Many publishers forbid this in their copyright declarations; even though the data is a set of facts, copyright still applies to the database. Andreas explained how the Science Commons movement is trying to find a legal solution to such problems as database rights and patents. Shuichi described how the industrial capitalist paradigm has driven science and technology for the past two centuries, and pointed out that while this has achieved much, it has also caused several problems: one case in point is IPR issues stunting scientific growth.
On the question of the ownership of data automatically generated by instruments (usually tied to the question of who owns the instrument), Andreas recommended writing down who owns what right at the start to avoid confusion later. Peter recommended that when it is unclear whether copyright rests with the researcher or the university, researchers should release the work under Creative Commons and if this is a mistake, the university can correct it later! The discussion further included issues relating to sensitive data such as medical and other personal data which cannot be made open by default.
Adrian Brown of The National Archives (TNA) presented a paper on the PRONOM format registry  and the DROID format recognition tool. TNA is developing an active preservation framework around PRONOM. This framework consists of three functions forming a cycle: characterisation, preservation planning and preservation action. The characterisation process will identify and validate the format of files, and extract the significant preservation properties (e.g. the compression algorithm used, the duration of an audiovisual clip, the text from a document). The preservation planning process will monitor the preservation risk associated with files held by TNA, and when the risks rise above a certain level for a set of files, the process will produce a migration plan for those files. The preservation action process carries out the migration, after which point the characterisation process recommences and the results are compared with the original files.
MacKenzie Smith of MIT Libraries and Reagan W. Moore of the San Diego Supercomputer Center presented a paper on the usefulness of machine-readable digital archive policies. The idea is that archivists can take assessment criteria for a trustworthy repository such as the RLG/NARA Checklist  and translate them into policies and thereby functional requirements for their repository. These requirements are taken by repository technicians and translated into machine-readable rules, which are used to control the preservation activities of the repository. Metadata and state information can be generated to record the outcome of the various activities, and these can be analysed later to confirm that the rules are satisfying the assessment criteria used at the beginning of the cycle. MIT Libraries, the University of California San Diego Libraries and the San Diego Supercomputer Center are putting all of this into practice using DSpace and iRODS.
Paul Watry of the University of Liverpool presented a paper on a prototype persistent archive being developed jointly by the University of Liverpool, the University of Maryland and the San Diego Supercomputer Center. The archive is modular in architecture, with three layers (application, digital library, data grid) each with various functional elements. The showcase tool was Multivalent, currently used as the user interface in the application layer and the document parser in the digital library layer. Multivalent is a Java program that uses media adapters to interpret the content and functionality of documents, and allows interaction with and the annotation of the document without changing the original file(s). While it was presented as a panacea to open practically any document or data file, the range of supported formats is actually quite limited at the moment (for example, its HTML support only goes up to HTML 3.2).
Graham Pryor of the University of Edinburgh presented a paper on the survey conducted for Project StORe, which is developing middleware to allow reciprocal linking between material in data repositories and material in publication repositories. The survey found that within the seven disciplines surveyed, dual deposit of publications and data was already an accepted working practice, and that there was a great deal of enthusiasm for international data sharing, data curation and data mining efforts. The importance of metadata was also appreciated. There were, however, a number of cultural and organisational barriers to data deposit. These ranged from an unwillingness to relinquish control of data, to fears that the data might be misinterpreted, to a perception of increased workload (due to time-consuming metadata creation). The current state of practice with regard to repositories was found to vary enormously between disciplines.
Three parallel sessions were held. We both attended the session on preservation repository experiences. The other sessions were on educating data scientists, and policies and persuasion.
Sam Pepler of the British Atmospheric Data Centre presented a paper tackling the issue of citing data sets. The method used was to derive a citation style by analogy with the National Library of Medicine-recommended format for the bibliographic citation of databases and retrieval systems. A lot of emphasis was placed on including a reference to a peer review process.
Jessie Hey of the University of Southampton reported on work being done by the PRESERV Project to link up repositories with The National Archives' PRONOM registry. The work has so far centred on using the DROID tool to produce file format profiles for registries, inform preservation policies and flag up unusual items.
Gareth Knight of the AHDS (Arts and Humanities Data Service) presented a paper on the SHERPA DP Project, which is looking at model architectures for repositories wishing to outsource their preservation functions. The principle is that the preservation service maintains a mirror of the repository as a dark archive that the repository can call upon should any of its material become corrupted or inaccessible.
Closing Plenary Session
The Conference closed with the launch of the International Journal of Digital Curation . This was followed by a keynote speech by Clifford Lynch of the Coalition for Networked Information. Digital curation as a phrase only became common currency in about 2000, but has now been recognised as a fundamental pillar of e-Science. Current trends are to focus on the 'knowable' life cycle, without worrying too much about the 'speculative' life cycle: the various ways in which data might be reused in the future. There is also a strong emphasis at the moment on putting data and metadata in good order right from the start, to make it easier to handle in the future. While work on metadata is progressing, there is still much more to be done.
The issues of digital curation are forcing reconsideration of other areas, such as the amount of informatics support given to researchers, and the ethics surrounding the use and reuse of health and social science data. It is also calling into question the nature of scholarly communication, and how it relates to e-Science. The rise of the Virtual Organisation, by its very nature an ephemeral beast, poses particular challenges for digital handling.
Cliff concluded by mentioning a topic that has not received much attention as yet: how to manage historical digital collections centred on a significant person or an organisation. How will data curators be able to cope with the diverse range of formats? How will such collections be brought together from the scatter of disc drives and Web services such as Flickr?
The Conference was enjoyable and brought together some highly valuable research in the field. The theme of the Conference, Digital Data Curation in Practice, was evident in the mix of papers presented. The conference Web site  and forum  remain open for reference.
*Editor's note: Readers may be interested to read of work by the JISC e-Prints Application Profile Working Group in this issue.
- The Digital Curation Centre web site http://www.dcc.ac.uk/
- 2nd International DCC Conference 2006: Digital Data Curation in Practice http://www.dcc.ac.uk/events/dcc-2006/
- The list of posters is available from http://www.dcc.ac.uk/events/dcc-2006/posters/
- The SmartTea Project http://www.smarttea.org/
- The R4L (Repository for the Laboratory) Project http://r4l.eprints.org/
- The eBank UK Project http://www.ukoln.ac.uk/projects/ebank-uk/
- The CombeChem Project http://www.combechem.org/
- The VERSIONS Project http://www.lse.ac.uk/library/versions/
- Functional Requirements for Bibliographic Records (FRBR) http://www.ifla.org/VII/s13/frbr/frbr.htm
- UCSD-Nature Signaling Gateway http://www.signaling-gateway.org/
- TNA's PRONOM Registry http://www.nationalarchives.gov.uk/pronom/
- RLG/NARA Checklist http://www.rlg.org/en/pdfs/rlgnara-repositorieschecklist.pdf
- International Journal of Digital Curation http://www.ijdc.net/
- 2nd International DCC Conference Forum http://forum.dcc.ac.uk/viewforum.php?f=19