Web Magazine for Information Professionals

Digital Curation: Where Do We Go from Here?

Peter Kerr, Fiona Reddington and Max Wilkinson report on the 1st International Digital Curation Conference held in Bath in September 2005.

The conference aimed to raise awareness of key issues in digital curation and to encourage active participation and feedback from the relevant stakeholder communities. The conference attracted an impressive range of keynote speakers and focused on the following areas:

The participants were a mix of researchers, curators, policy makers and representatives from funding agencies that are engaged, or have an interest, in the creation, use and management of digital data.

The conference was officially opened by Chris Rusbridge, Director of the DCC, who welcomed everyone to the event and encouraged attendees to ask plenty of questions and actively participate in the discussion sessions to make the conference as interactive as possible.

photo (71KB) : Graham Cameron, Associate Director at EBI, opens the Conference with his Keynote Speech

Graham Cameron, Associate Director at EBI, opens the Conference with his Keynote Speech

Day 1

Data Curation and the Scientific Record: Opportunities, Challenges and Pitfalls

Graham Cameron, Associate Director. European Bioinformatics Institute

Graham Cameron explained the EBI's experience with biomolecular data and illustrated the remarkable enhancement to the conduct of science offered by shared curated databases. He illustrated the usefulness of biomolecular data to scientific communities by providing compelling data about the submission rates for primary research data and the global access to the EBI Web site (both of which are exponential in growth) [1]. However, it is the integration of databases that now provides the greatest challenge. Graham stimulated discussion about long-established conventions in constructing and maintaining databases and repositories of biological data and concluded that the integrity of the conventional (printed) record is woefully inadequate in the electronic era.

Overview of the Digital Curation Centre

Chris Rusbridge, DCC Director

Chris Rusbridge outlined the main objectives, membership, establishment and work to date of the DCC. It was explained that the DCC will not provide any data storage facilities but will focus on providing researchers with useful tools, advice and services which they will need to help them properly curate their data. Several projects from a variety of disciplines were described which highlighted the explosion in data generation that has occurred alongside the emergence of new technologies. Attention was drawn to the issue of long-term sustainability of databases and the need for the adoption of data policies by large-scale institutions and funding agencies. The issue of the scope of the DCC was also discussed. The importance of participation in the Associates Network was highlighted and a summary of future workshops and the launch of The International Journal of Digital Curation was also presented [2].

Following lunch, Chris Rusbridge chaired a symposium entitled "What is Digital Curation?" Professor Peter Buneman and Dr David Giaretta gave short presentations designed to provoke discussion from the attendees regarding the topics of the different approaches to data curation and the skills, strategies and partnerships that need to be put in place to ensure effective data curation takes place. The main topics discussed were:

Should the definition of data curation encompass preservation and access strategies to data. Opinions differed but a clear definition, in terms of the remit of the DCC, will need to be agreed.

What data should be preserved? It will not be possible to preserve and/or curate all the data that is generated. Careful thought will need to be given to the strategies used to decide what data to preserve, these strategies are likely to differ between disciplines.

The issue of ensuring sufficient interaction between the data generators (domain experts) and librarians/archivists (preservation experts). What is the best way to do this? Is it realistic to expect people to have expertise in both? Training will need to be addressed alongside encouraging individuals with different expertise to work together.

How can we future-proof what is archived? We need to ensure that the technologies used currently are not obsolete in the future and that sufficient migration strategies are put in place and funded.

How do we achieve policy changes at the level of academic institutions and funding agencies? Data sharing policies, which encourage thinking about data preservation and curation at the early stages of project lifespan would be beneficial. The National Cancer Research Institute (NCRI) Informatics Initiative has generated a data-sharing policy [3] that is now being adopted by the top 20 cancer funders in the UK - how could this approach apply to other areas?

Sustainable resources: It is all very well encouraging people to submit their data for archiving and preservation but we need to ensure that the resources which hold the data are sustainable in the long term

After a well deserved coffee break, the afternoon session dealt with "Global Policy for Curation" and was chaired by Richard Boulderstone, Director of e-Strategy at the British Library.

nestor - the German Approach to Digital Curation

Stefan Strathmann, Goettingen State and University Library

Stefan Strathmann provided the meeting with an update of the 'nestor' Project, funded by the German Federal Ministry of Education and Research [4]. Along with organised workshops and conferences, nestor has or is due to deliver a number of expert reports regarding the German experience and approach to digital curation. These reports will form the basis of guidelines to stakeholders regarding digital preservation, translations of articles, future teaching materials and project glossary.

NDIIPP and Approaches to Digital Curation in the US

Caroline Arms, Office of Strategic Initiatives, Library of Congress

Caroline Arms described the National Digital Information Infrastructure and Preservation Program (NDIIPP) [5]. Created in 2002, the NDIIPP aims to work with public and private stakeholders to support the preservation of significant 'born-digital' data that is at risk. The NDIIPP operates as an external focus to encourage collaboration across different sectors and to establish connections between stakeholder communities. Awards are provided to form co-operative agreements and promote four principles:

The last session of the day was the poster session which preceded the hugely enjoyable conference dinner at the Roman Baths and Pump Rooms.

photo (69KB) : Drinks at the Roman Baths

Drinks at the Roman Baths

Day 2

Day two of the conference opened with the issue of "Data Curation: A Question of Scale", which was chaired by Professor Malcolm Atkinson, Director, National eScience Centre.

The Data Challenge in Astronomy

Professor Andy Lawrence, Regius Professor of Astronomy and Head of the School of Physics, University of Edinburgh

Andy Lawrence described the imbalance in funding for IT in astronomy in that facility operations and facility output processing tended to be well funded whereas the science archives and end-user tools attracted less money. Astronomical archives are growing in size very quickly; however the issue is one of management and data access rather than storage. The amount of data itself is not a technical problem but the data is heterogeneous and end-users of the data have different demands. The astronomy field has recognised that the issue is one of interoperability of archives and there is a need for standards, transparent infrastructure and specialised data centres. To address this the Virtual Observatory is being established - the concept is one of all databases being easily accessible to the end-user. Rather than a warehouse it will consist of a small set of service centres and large populations of end-users. The International Virtual Observatory Alliance [6] has set up technical working groups to agree on the standards necessary and the key projects include AstroGrid, US-NVO and Euro-VO. This interoperability should drive future scientific developments in the astronomy field.

Following a coffee break, there were parallel sessions addressing "Sustainability: Technical and Economic Challenges" and "Meeting User Requirements":

Sustainability: Technical and Economic Challenges

Chair: Kevin Ashley, Head of Archive, University of London Computing Centre

Kevin Ashley explained that this session would provide an introduction to the DSpace Project, which is providing a cross-disciplinary data archiving solution, as well as an overview to the technical and economic challenges that need to be considered with regard to the sustainability of data resources and the data they contain.

Managing MIT's Digital Research Data with DSpace

Mackenzie Smith, MIT Libraries

Mackenzie Smith spoke about the growing realisation that data management is an important issue for large-scale institutions to address, both to enhance research programmes and address the business need of data validation and mining in the future. The disparity between the approaches of different disciplines to the access to, and sharing of, data was discussed and the differing availability of large-scale repositories to support disciplines was also highlighted. The complex nature of research data was described, the tasks needed for effective curation and archiving and the need for an in-depth understanding of the subject were highlighted. The DSpace Project was explained [7] and the benefit of being able to store different data types in an open and interoperable way was described. It was explained that the preservation policy being adopted was that it is better to preserve what we can now and refine this as needs and technologies change over time. This is deemed to minimise the risk of losing data completely while trying to devise the 'perfect' solution.

Sustainability: Technical and Economic Challenges, the DPC Perspective

Maggie Jones, Digital Preservation Coalition (DPC)

Maggie Jones spoke about the challenge of effectively archiving and preserving digital collections as the pace of data generation grows ever faster. Maggie explained that a more collaborative approach will be needed to provide multi-disciplinary, cost-effective archiving solutions and that we should aim to manage data from its creation onwards. However, we need to take a pragmatic approach and Maggie echoed the approach of the previous speaker in preserving what we can now to minimise losing data. Several initiatives are now addressing key issues [8] [9]. The future work-plan of the DPC was outlined, including the provision of targeted training courses, and attendees were encouraged to complete the DPC survey [10]. This survey will identify, not only what is being created in digital format and how it is being preserved, but will assess the level of risk and vulnerability to loss, and determine priorities for action. The results will enable the DPC to accelerate, influence and inform the development of a UK digital preservation strategy.

Meeting User Requirements

Chair: Professor Kevin Edge, Pro-Vice Chancellor, University of Bath

Professor Kevin Edge chaired a session looking at how user requirements are being met in the digital curation age. This took the form of a panel discussion with Sheila Anderson, Professor Kevin Schurer and Mark Thorley on the panel. The discussion centred on a number of topics including what the main threats are in providing continued access to the research and teaching digital objects in a collection over time. The group discussed what could be done to minimise these threats. The role of research councils in supporting the preservation of and access to data was examined and how researchers or data creators could minimise the risk of data being lost, or being inaccessible in the longer term.

Following lunch, there were once again parallel sessions addressing "Socio-legal Issues" and "The Research Agenda":

Socio-legal Issues

Neil Beagrie, JISC

Legal issues surrounding data curation are complex. This is most generally a result of heterogeneous data, the ownership and custodian roles of researchers and curators and uncertainty of the responsibility of persistence and preservation of data. Much of the conference up until now had dealt with identifying a curator's role, identification of data to preserve and the programmes that have developed to address these issues. This session dealt with the curation of data in the context of copyright and access rights and how these issues were shaping the way data creators, curators and funders approach data preservation and reuse.

An Exploration of the Legal Issues associated with Data Curation

Andrew Charlesworth, University of Bristol

Andrew Charlesworth, Senior Research Fellow in IT and Law at the University of Bristol, spoke on the role of copyright law in data curation. Andrew noted that recent events had led to increased expectations and influence of rights holders and that this had generated a backlash, which considers copyright law to be over-strength, over-regulated and over-rated. The result of this backlash has been the emergence of open-source movements and Creative Commons where some copyright agreements exist but to a much lesser degree. Andrew then presented a Science Commons where standardised licences are voluntarily agreed but still manage to create areas of free access and inquiry. The major areas of this commons can be found in publishing, licensing and the use (or more correctly reuse) of data. Andrew suggested the copyright should be more focused on the researcher, the so-called 'researcher rights' but this required more clarification and was dependent on engaging the data creators and curators to work out what these 'researcher rights' should be.

Curation Case Study: e-Diamond

Dr Paul Taylor, Centre for health Informatics and Multiprofessional Education, University College London

Paul Taylor presented a case study of curation based on the e-Diamond Project, which aimed to deliver a prototype Grid-based image repository that could be used as a training tool in breast screening, epidemiology and data mining and would exemplify the utility of data reuse and added value [11]. Paul introduced the breast-screening programme in the UK and provided an overview of why digitising and archiving mammograms would be beneficial for researchers, healthcare professionals and patients. Paul talked about the problems with ethical and consent issues which presented significant barriers to the project but suggested that the barriers were a result of both complex and bureaucratic ethics approval processes and the project team's under-estimation of their complexity. Paul concluded his presentation by summarising the significant output of e-Diamond, a valuable set of mammograms and a proof of concept for the value of Grid infrastructure for evidence-based medicine and research application.

The Research Agenda

Chair: Professor Peter Buneman, DCC Associate Director (Research)

Professor Peter Buneman chaired this parallel session which highlighted some of the research occurring in the field of digital curation. There was standing-room only for this well-attended session.

AstroDAS and Mondrian: New Approaches for Annotating Scientific Databases

Dr Rajendra Bose, DCC

Dr Rajendra Bose introduced new approaches to annotation and described the relevance of annotation to digital curation. He firstly described the AstroDAS Project which built on the distributed annotation system (DAS) from the bioinformatics field (BioDAS). This prototype aims to link the annotated entries across diverse astronomy sky catalogues that correspond to the same sky object. The Mondrian Project was also presented - another model of annotation, it allows annotations of the associations between attribute values as well as the attribute values themselves.

Immortal Information, Engineering Grand Challenge Project

Professor Chris McMahon, University of Bath

Professor Chris McMahon described how engineering companies are dealing with a shift from product delivery to a situation where firms both supply products and support them through their lifetime. This creates a need for effective information creation, management, storage, access and possibly re-creation through many generations of information technology and systems. The Engineering Grand Challenge Project involves a number of Innovative Manufacturing Research Centres, as well as other partners. and has three streams:

Many of the issues described echoed the digital curation issues of other communities.

The Long-term Preservation of Scientific Models

Professor Jane Hunter, University of Queensland

Professor Jane Hunter started by describing the PANIC Project [12] which aims to address the preservation and accessibility of composite digital objects using semantic Web services. Scientific data is organised into scientific models which encapsulate contextual information, processing steps and derived knowledge as well as links to the raw data itself. There was also discussion of the tools and services that might be needed as these scientific model packages evolve over time.

Closing Keynote Address

Dr Clifford Lynch, Coalition of Networked Information (CNI)

Dr Clifford Lynch provided an entertaining and informative overview of the conference and highlighted some of the key issues that had been raised. In particular, the rising importance of digital data in research, learning and teaching was emphasised and the need to adapt our institutional and individual practices to reflect this was highlighted. The need for a collaborative approach was reiterated and the usefulness of conferences such as this one, as a forum for discussion, was reinforced.

Conclusions

The conference ended with Chris Rusbridge thanking the speakers and attendees for their participation in what had proved to be a very enjoyable and informative conference. It is clear that a lot of work will need to be done to address the issues raised at the conference and that community participation is going to be key to making the DCC a success. We at the NCRI Informatics Initiative look forward to remaining involved, attending future events and monitoring the progress of the project.

References

  1. European Bioinformatics Institute (EBI) http://www.ebi.ac.uk/
  2. International Journal of Digital Curation http://www.ijdc.net/
  3. NCRI Data Sharing Documents http://www.cancerinformatics.org.uk/documents.htm#datasharing
  4. nestor http://nestor.sub.uni-goettingen.de/?lang=en
  5. National Digital Information Infrastructure and Preservation Program (NDIIPP) http://www.digitalpreservation.gov/
  6. International Virtual Observatory Alliance (IVOA) http://www.ivoa.net/
  7. DSpace http://www.dspace.org/
  8. UK Web Archiving Consortium http://www.webarchive.org.uk/
  9. International Internet Preservation Consortium (IIPC) http://netpreserve.org
  10. UK Digital Preservation Needs Assessment survey http://www.tessella.com/dpcsurvey/
  11. eDiaMoND http://www.ediamond.ox.ac.uk/
  12. The PANIC (Preservation webservices Architecture for Newmedia and Interactive Collections) Project http://metadata.net/panic/

Author Details

Dr Peter Kerr
Scientific Programme Manager
NCRI Informatics Coordination Unit
61 Lincoln's Inn Fields
PO Box 123
London
WC2A 3PX

Email: peter.kerr@ncri.org.uk
Web site: http://www.cancerinformatics.org.uk

Dr Fiona Reddington
Scientific Programme Manager
NCRI Informatics Coordination Unit
61 Lincoln's Inn Fields
PO Box 123
London
WC2A 3PX

Email: fiona.reddington@ncri.org.uk
Web site: http://www.cancerinformatics.org.uk

Dr Max Wilkinson
Scientific Programme Manager
NCRI Informatics Coordination Unit
61 Lincoln's Inn Fields
PO Box 123
London
WC2A 3PX

Email: max.wilkinson@ncri.org.uk
Web site: http://www.cancerinformatics.org.uk

Return to top