The International Digital Curation Conference has been held annually by the Digital Curation Centre (DCC)  since 2005, quickly establishing a reputation for high-quality presentations and papers. So much so that, as co-chair Allen Renear explained in his opening remarks, after attending the 2006 Conference in Glasgow  delegates from the University of Illinois at Urbana-Champaign (UIUC) offered to bring the event to Chicago. Thus it was that the sixth conference in the series , entitled 'Participation and Practice: Growing the Curation Community through the Data Decade', came to be held jointly by the DCC, UIUC and the Coalition for Networked Information (CNI).
The conference was preceded by a day of workshops: 'Digital Curation 101 Lite', a training course run by the DCC, focusing this time on data management planning; 'CURATEcamp', the second in a series of unconferences on digital curation tools (the first having been held at the University of California, Berkeley, in August 2010); 'Improving researchers' competency in information handling and data management through a collaborative approach', an introduction to the Research Information Network's Working Group on Information-Handling ; 'Introduction to the Data Curation Profile', the latter being a requirements-gathering tool developed by librarians at Purdue University and researchers at the Graduate School of Library and Information Science at UIUC ; and 'Scaling-up to Integrated Research Data Management', organised by the I2S2 Project  as an exploration of data management issues arising from working across organisational boundaries and at different scales of science.
Chris Lintott opened the conference with an inspiring talk about the potential of citizen science. The Sloan Digital Sky Survey, which ran between 2000 and 2008, represented a new way of performing astronomy: instead of teams identifying interesting objects and then booking telescope time to examine them, a telescope was set up to collect data from entire strips of the night sky, for astronomers to examine later. In the end, about a quarter of the sky was mapped out in this way. This is, of course, a lot of data, and any astronomers deriving results from the whole dataset soon run up against issues of scale. One team, interested in the frequency with which certain types and orientations of galaxy appear, found that, even at a rate of 50,000 galaxy images a month, they could not hope to classify them all by hand in a reasonable amount of time. Neither could they fall back on computer classification, this being of insufficient quality.
The solution they came up with was to enlist the help of the public, and so Galaxy Zoo was born . This platform made it quick and easy for anyone to classify galaxies; it proved so popular that within one month around 80,000 volunteers had made over 10 million classifications. Before long, each galaxy image had been independently classified by upwards of 20 volunteers, meaning the confidence in each classification could be measured. There were other, unexpected benefits of having so many eyes on the data. The volunteers were spotting unusual phenomena in the images and reporting them in the Galaxy Zoo forums. This led to discovery of a frog-shaped object now known as Hanny's Voorwerp, and a class of green spheroids that were nicknamed 'peas'.
The original Galaxy Zoo project has now concluded, but it proved so successful that its Zooniverse platform has been developed further and now supports eight different projects, three of which are new Galaxy Zoo projects, and one of which is nothing to do with astronomy (it concerns transcribing ships' logs to extract climate data).
Lintott drew three lessons of citizen science from the Galaxy Zoo experience. First, people are better motivated to participate if they understand the purpose, context and consequences of the activity. Second, volunteers prefer to be treated as collaborators rather then test subjects. Lastly, projects should not waste people's time by, for example, soliciting comments that will never be read, or inviting people to participate in activities that make no real contribution.
In his talk, Kevin Ashley asked how many curation services are enough. He identified three centres for digital curation in the US, but these centres do not provide just a single service, and in any case there are many stakeholders: libraries, archives, data centres, publishers and so on. He went on to consider some of the services one might need for digital curation:
Revisiting comments made by Martin Lewis at the 2008 Conference  about the roles university libraries could play in digital curation, Kevin noted that while some of them had come to pass - raising awareness, leading on data management policy issues, working with institutional IT services - others had not: for example, promoting data literacy, and training staff in digital curation. Ashley also gave an update on the UK Research Data Service (UKRDS); plans are complete for a pathfinder service consisting of local data archiving capability at four UK institutions, with support and co-ordination provided nationally by a small team. There are still questions to be resolved, however. Will all institutions have a UKRDS node? Will the role of subject data centres change? Which services should be local and which national? Will there be an international role for UKRDS? How will data curation fit in with other aspects of research administration within institutions?
Kevin concluded with some thoughts about the uses to which research data might be put if made available, pointing to courses for journalists on telling stories from data, and the Globe4D system for visualising and interacting with planetary data .
There is a lot of chemical information and data on the Internet, but a substantial proportion of it is wrong, some dangerously so. Even trusted chemical databases contain a surprising number of errors and inconsistencies. In response to this, Antony Williams and colleagues set up ChemSpider as a community catalogue of chemical structures . It collects links to information – physical properties, toxicity data, metabolism data, safety information, associated journal papers, database entries and patents, etc. – from over 400 different data sources, including journals, vendor databases and individual chemists. This information is then curated by a hierarchy of volunteers. One of the major tasks is to ensure that the identity relationships within the database – linking structures to the right identifiers, names and information – are correct and complete, but there is also work to do in correcting typographical errors, deprecating low-quality information, and so on.
Williams drew several lessons from his experiences with ChemSpider. The 'crowds' in crowdsourcing can be quite small: ChemSpider has validated over a million structure-identifier relationships over three years with only 130 volunteers. It is possible to perform subtle analyses with volunteer effort: for example, the Spectral Game  uses player mistakes to locate poor-quality spectra in the database. Lastly, all chemical databases would benefit from accepting and displaying comments from users: ChemSpider has uncovered many mistakes in this way, while only experiencing three cases of vandalism.
Systems biology is a discipline where a vast amount of data has to be gathered before patterns begin to emerge and conclusions can be drawn. Barend Mons described an effort to make this rather easier using Linked Data. The approach of the Concept Web Alliance  and their partners in the Open PHACTS Consortium is to translate life sciences papers (for example) into sets of assertions expressed as subject-predicate-object triples, with each part of the triple expressed using a non-semantic Universally Unique Identifier (UUID) to eliminate ambiguity. Each of these triples, when combined with meta-assertions concerning authorship, date and provenance, form a nano-publication. These nano-publications are entered as evidence into the ConceptWiki triple store. From the nano-publications a set of canonical assertions is generated: that is, nano-publications that all make the same assertion are grouped together, so the end-product is a canonical assertion supported by one or more nano-publications. In this way, one ends up with a database of unique assertions, where the trustworthiness of each assertion can be judged using the number and quality of supporting nano-publications. An inference engine can then run over the database to generate new knowledge.
The messages drawn by Barend from this work were that scientific data are most useful when in a form that allows computers to reason from them, and that there is a role for the crowd in performing quality-control over the assertions. He saw data publication and data citation metrics as priority issues to address for this approach to work.
In the humanities, ambiguity is an area of interest, rather than something to be eradicated, but this raises all sorts of tensions. John Unsworth took as an example the process of turning a messy Wittgenstein manuscript into clean text for printing. Normalisation of orthography and grammar may be important for retrieval, and thus for discovery, but the variations may be important evidence of an author's thought processes and creative style.
John went on to describe MONK, a digital environment allowing humanities scholars to perform text mining over a corpus currently consisting of American literature and the works of Shakespeare . The process for ingest into the corpus includes transformation of the source digital texts into TEI Analytics format, tagging for parts of speech and cataloguing. Again, there is plenty of scope for crowdsourced activity – proof-reading the texts, correcting the tagging and metadata – but there needs to be caution over eliminating potentially valuable information.
The early afternoon was set aside as a 'Community Space', allowing people to interact with the authors of posters, experience demonstrations, or hold their own informal meetings, curation clinics and so on. As in previous years, this was preceded by a session entitled 'Minute Madness', wherein poster authors and demonstrators gave one-minute pitches to attract delegates to their stand. All those participating rose to the challenge admirably, with the chair, Sayeed Choudhury of Johns Hopkins University, only having to intervene once with his referee's whistle.
The School of Information Studies at Syracuse University has set up courses designed to train students to be eScience professionals. Youngseek Kim described the method by which the curriculum was designed. Interviews and focus groups were set up with laboratory directors and researchers to determine the key characteristics of an eScience professional position, and this information was used to search for vacancies; 208 were found over a period of a month. The job requirements for these vacancies were analysed in collaboration with the focus group participants and interviewees to determine common work tasks, worker qualification requirements and organisational environments. This information was used to put together a prototype master's course.
Researchers then followed the progress of five of the students as they undertook summer internships. Daily activity logs and exit questionnaires were used to collect information about the tasks they performed and the knowledge and skills they had to use. This information was analysed in a similar way to the earlier data, and provided strong confirmation of the previous results. The recommendations arising from this research was that the course should focus on cyberinfrastructure, data curation and team communication.
The theme of the late afternoon symposium was set in two short presentations by Sheila Corrall and Christine Borgman. With the recent announcement that the National Science Foundation would start mandating data management plans (DMPs), delegates were asked to discuss what DMPs actually comprise, who should be leading data management strategy, and who should be the teachers and students of data management.
On the first matter, what constitutes a DMP, the discussion started with an argument that DMPs should be written by data creators, guided by institutional policy on what data management processes are supported. Delegates then considered issues such as measuring the success of data management, finding sustainable funding models – parallels were drawn with open access publishing – and ensuring data are archived by trustworthy hosts. Several different models were proposed for the governance of data management. At Cornell University, a virtual organisation has been set up to provide data management services. Another delegate suggested a bottom-up approach to data management, though others were not convinced this could be effective: at the very least a senior advocate is needed within the institution. Which people are involved does make a difference to the way data management is handled: archivists have a more ruthless attitude to data appraisal than librarians, and no doubt legal teams would bring an entirely different perspective should data management be seen as a compliance issue. Consideration should also be given to the service levels that are possible when using external data management services.
A wide spectrum of approaches were advocated with regard to data management education: everything from optional courses to intensive master's courses or doctorates. This reflected that fact that data management is relevant to so many people: not just professional curators, but also researchers, journalists and the general public. It may be possible to construct a common, core conceptual infrastructure for data management, but one shouldn't try to prescribe a common skill set: there is too much variability in what the different roles demand.
The themes of the day were brought together by Clifford Lynch. He started by considering the implications of DMP mandates: if principal investigators only have ambitions to archive data for five to ten years, this shifts the emphasis away from representation information and preservation metadata, towards bit preservation and other storage issues. Similarly, in the context of vast data resources, the most pressing curatorial challenges lay around quality-control and aggregation rather than preservation. The ease with which data can be replicated means that databases can be corrupted as never before, but the talks earlier in the day had shown the power of crowdsourcing and citizen science in improving (and even expanding) the scientific record.
The momentum of digital curation curricula shows that the discipline is approaching maturity, although there is plenty of scope for more joined-up thinking in the area, particularly in the US. There is also much more thinking to be done about handling ambiguity, contradictory data, levels of certainty of knowledge, and the contrasts between measurement and interpretation.
MacKenzie Smith opened proceedings on the second day with a talk on the interdisciplinarity (or metadisciplinarity) of data curation. She started by identifying seven functions that define digital curation. Finding data is currently an unsolved problem. Provenance information – authorship, authority, development history – is needed to make sense of data. Tools have been developed to help researchers analyse, visualise and reproduce results, that is, work with data, but they are not well catalogued, archived or accessible at the moment. There is much to be gained from aggregating data, but doing so in practice is hard: domain conventions vary, and even if the Resource Description Framework (RDF) is used as a common syntax, someone still has to do the semantic mappings between encodings. Furthermore, the laws surrounding data are not interoperable either. Methods of publishing data are just beginning to emerge. Referencing data is even trickier: how can one cite data at the right level of detail, giving due credit to data creators, without drowning aggregated datasets in a sea of attributions? By now we have a good understanding of curating data; the problem is how to fund this activity.
MacKenzie then provided technological, functional and organisational views on what she termed the curation ecology, and went on to describe the roles she saw each type of organisation playing. For example, she argued that researchers should focus on providing data with provenance metadata and clear statements on intellectual property; institutions should contribute data policy, incentives for researchers and a stable financial basis for data curation; and libraries should provide support to researchers, work on data models and metadata ontologies, and define archiving services in support of long-term preservation. These last points were illustrated using case studies from the University of Chicago, responsible for curating the Sloan Digital Sky Survey, and MIT.
One of the innovations at this year's conference was the introduction of a prize for the best student paper. The inaugural recipient of this accolade was Laura Wynholds, for her paper on managing identity in a world of poorly bounded digital objects. She argued that, in order to make the problem tractable, four conditions must hold: digital objects must be given neat boundaries, semantically and logically; digital objects should have identifiers embedded within them; it should be possible to retrieve digital objects using their identifiers; and identity management should be embedded within a scholarly information system, so that identifiers can also be used to retrieve information about authorship, intellectual property rights and so on.
Another innovation this year was the introduction of a second method of submission aimed at practitioners, where selection was based on abstracts rather than full papers. With more papers accepted than ever before, presentations had to be split between three parallel tracks of two sessions each. I attended the two sessions in the third track, which were more loosely themed than the other four.
Christopher Prom (University of Illinois at Urbana-Champaign) described his Practical E-Records Project, in which he assessed digital curation tools from an archival/records management perspective, developed policy templates, formulated recommendations, and posted the results to a blog .
James A. J. Wilson (University of Oxford) presented the latest results from the Sudamih Project , which is developing training modules to improve the data management skills for researchers in the humanities, and setting up a simple Web-based system to offer them databases-as-a-service.
Patricia Hswe (Penn State University) introduced her institution's Content Stewardship Program . Among other things, the programme saw an overhaul of Penn State's digital library platforms and the development of a Curation Architecture Prototype Service (CAPS), which uses microservices to fill in the gaps left by the other platforms, as identified by reference to the DCC Curation Lifecycle Model.
W. Aaron Collie (Michigan State University) argued the case for using electronic thesis and dissertation workflows as a testbed for data publishing in the wider field of scholarly communication.
Wendy White (University of Southampton) gave an overview of the IDMB Project , which is producing an institution-wide framework for managing research data, comprising a business plan for data management, pilot systems for integrating a data repository into researchers' workflows (using SharePoint and EPrints), and training courses and materials.
Robin Rice (University of Edinburgh) enumerated the various research data management initiatives underway at her institution, including the Data Library, Edinburgh DataShare (a data repository hosted by the Data Library), the Research Data Management Training (MANTRA) Project (responding to gaps identified during five Data Asset Framework case studies) and working groups on research data management (RDM) and research data storage (RDS).
Adrian Burton (Australian National Data Service) explained the ANDS Data Connections strategy and projects. In brief, the idea is to provide registers of identifiers so that data can be linked together more easily. So, for example, the Office of Spatial Data Management provides a scheme to identify locations, the National Library of Australia provides identifiers for people, research councils would provide identifiers for research activities, the Australian Bureau of Statistics would provide identifiers for academic fields derived from the Australia/New Zealand Standard Research Classifications, and so on.
Ellen Collins (Research Information Network) revealed the results of a survey looking at who is using the UK subject-based data centres and for what purposes, how much they use them, and what impact the data centres have had. The survey found the data centres were credited with improving data sharing culture, improving research efficiency and reducing the time required for data acquisition and processing.
Ixchel Faniel and Ann Zimmerman (University of Michigan) argued that research into the increasing scale of data sharing and reuse has focused too narrowly on data quantities. They identified at least three other issues that should be addressed: broader participation, in terms of both interdisciplinary and citizen science; increases in the number and types of data intermediaries (archives, repositories, etc.); and increases in the number of digital products that contain data.
MacKenzie Smith (MIT Libraries) and Kevin Ashley (Digital Curation Centre) discussed the matter of digital library policy interoperability, defined as the ability to compare another organisation's values and goals with one's own in order to conduct business with them. The DL.org Project  conducted a survey of digital libraries. While all respondents had policies of some description, attempts to harmonise them with those of other libraries had only been made in the areas of preservation, access, collection development and metadata (and not, for example, authentication or service level agreements).
Kate Zwaard introduced the Federal Digital System (FDSys) in use at the US GPO. It is a content management system, preservation system and search engine designed to cope with the specific demands of government publications. For example, people generally need to search for specific information rather than particular documents, but a naïve full-text search would not take into account the repetition between different editions of a document; FDSys therefore uses a faceted interface to support quick filtering, and document relationships to provide automated navigation.
David Walling described how the Texas Advanced Computing Centre is using iRODS to automate the extraction of metadata from archaeology data collections held by the University of Texas' Institute of Classical Archaeology. In fact, iRODS delegates most of the work to an extractor script written in Jython. This script not only uses the File Information Tool Set (FITS) to extract metadata from individual files, and a specialist tool to retrieve metadata from the existing Archaeological Recording Kit (ARK) system, but also parses file and directory names for additional metadata.
Michael Lesk explored the problem of how to integrate social science data automatically. Combining data from two studies is not as simple as translating the terms of one study of those of another: the problem lies not just in synonymy but also in ambiguity, difference of emphasis (e.g. warfarin as drug, versus warfarin as poison) and difference in how a question has been asked. Ontologies can go some way to help, but it is hard to make accurate mappings between them when the terms are so terse, and when they view concept space from very different perspectives. The semi-automatic approach seems more promising, that is, computers using ontological mappings to find possible correspondence between datasets, and leaving humans to decide whether and how to combine the data.
Huda Khan introduced Cornell University's Data Staging Repository (DataStaR). The compelling feature of this repository is that it uses an RDF triple store to hold metadata for the data files deposited into the system. This makes metadata entry highly efficient, as assertions applicable to many items (e.g. the contact details for an author, the funder for a project) need only be entered once. There were some problems to overcome, though, such as how to deal with an author having different roles in different projects, or how to apply access restrictions to metadata. The solution was to use named private graphs: sets of triples that 'belong' to a particular dataset, overriding the public triples in the database; hiding the dataset also hides its private graph.
Catharine Ward (University of Cambridge) outlined how the Incremental Project  went about assessing the training and support needed by researchers at Cambridge and Glasgow with regards to data management, and how the project is seeking to address these needs through guidance documents, courses, workshops and one-to-one advice.
Peter Botticelli (University of Arizona) and Christine Szuter (Arizona State University) talked about the challenges of and opportunities for teaching digital curation. Three courses were given as examples: the University of Arizona's graduate certificate programme in digital information management (DigIn); Clayton State Univeristy's Master of Archival Studies programme; and Arizona State University's Scholarly Publishing graduate certificate programme.
Lisa Gregory (State Library of North Carolina) gave a candid assessment of how the DigCCurr I master's course at the University of North Carolina, Chapel Hill prepared (or did not do quite enough to prepare) her and fellow student Samantha Guss for their current library positions.
Felix Lohmeier (State and University Library, Göttingen) and Kathleen Smith (University of Illinois) introduced the TextGrid Virtual Research Environment , consisting of the TextGrid Laboratory application, a unified interface to a collection of research tools, and the TextGrid Repository, a grid-based archive providing long-term storage, access and preservation for research data.
Martin Donnelly (DCC/University of Edinburgh) presented findings about researchers' perspectives on data curation issues, gained from observing project meetings and interviewing key team members of the team working on the MESSAGE Project . Another set of case studies had been performed by the DCC on behalf of the Research Information Network and the National Endowment for Science, Technology and the Arts. Angus Whyte and Graham Pryor (DCC/University of Edinburgh) explained what these case studies had revealed about the perceived benefits of open data among researchers in astronomy, bioinformatics, chemistry, epidemiology, language technology and neuroimaging.
Running software on an emulated operating system is an attractive preservation technique, but often overlooked is the matter of system library dependencies. Windows programs in particular are shy of declaring which versions of which system libraries they require. This is not usually a problem: common Windows libraries do not change much within a single version of the operating system, and installer programs usually bundle unusual libraries. The problem comes when installing multiple programs in a single emulated OS, as some programs attack each other, or use conflicting versions of the same library. Aaron Hsu argued that what is needed is, in short, a system like a Linux software repository for Windows programs. As a first step towards building such a database, he and his colleagues have built a tool for extracting DLL dependencies from archives of Windows software. It runs, somewhat ironically, on Linux.
Michael Sperberg-McQueen presented a technique for testing the quality of a translation from one XML format to another, similar in philosophy to techniques for 'proving' software. Put simply, a document is written in the source format, and from it a set of sentences is derived, each sentence being an inference licensed by the occurrence of the markup in the document. The document is then translated into the target format, and a second set of sentences derived. One can check for information loss by checking that every sentence in the source set follows from the translation set of sentences; equally, one can check for the introduction of noise by checking that every sentence in the translation set follows from the source set of sentences. Sperberg-McQueen admitted these were costly operations, and probably only justified for critical format translations; there are also complications with applying the technique, such as sentences that properly apply to just one of the two documents, or pairs of formats whose semantics overlap in messy ways. Nevertheless, Sperberg-McQueen was confident that the technique could be extended to all digital formats.
Maria Esteva demonstrated a tool for visualising at a glance the properties of entire collections of records. Hierarchies of directories are represented as nested rectangles within the visualisation; the size and colour of each rectangle represent different properties depending on the view. For example, in the general view, the size of a rectangle corresponds to the number of files within the corresponding directory. Each rectangle has three (or fewer) coloured areas: an outer black border, the area of which corresponds to the number of files of unrecognised format; within that, a white border, the area of which corresponds to the number of files in a format that is recognised but without a sustainability score; and a central square, the area of which corresponds to the number of remaining files, and the colour of which represents the average sustainability score for the formats used. A selector view allows a curator to configure the visualisation to display different statistics (e.g. file size rather than frequency).
The conference was brought to close by Stephen Friend, who in his keynote address argued that the drug discovery process must, and now can, undergo a shift of paradigm and scale similar to that seen in the physical sciences during the Renaissance. It currently takes 5–10 years to develop a drug and get it approved, at a cost approaching US$2 billion; and even then, approved cancer drugs only work in a quarter of cases. Changing this is not simply a matter of constructing clinical trial cohorts more precisely; the whole paradigm of drug development needs to move away from cancelling symptoms to making precise corrections to the operation of the body.
With machines now capable of sequencing genomes within a couple of hours, at a cost of a few hundred dollars, we are in a better position to do this than ever before. Gene interactions are complex – there are buffers to make genomes resilient to small changes – so linear maps are not enough: we need to understand genes as control networks. Sage Bionetworks is an enabling forum for research in this area ; one of its key aims is to build up a data commons from contributions of scientists around the world. For example, genomes and data are being collected from clinical trials trying to identify a biomarker to indicate if a patient would fail to respond to approved cancer drugs. Further genome/drug response data are being collected from the control (i.e. established drug) and placebo arms of industry clinical trials.
Stephen identified several success factors from this work. Hoarding data harms everyone's results: the Structural Genomics Consortium collected protein structures, asking that they be made public domain, and as a consequence managed to get high-quality research results far more quickly than would otherwise have been possible . This pattern has been repeated by the Sage Federation: researchers working with their pilot commons of ageing, diabetes and Warburg data managed to publish three Nature papers in four months. Furthermore, it is counter-productive to thrash out standards and ontologies prior to collecting data; it is much easier to construct ontologies once a substantial body of data has been collected. Lastly, clinical and genomic data and models benefit from being handled in the same way as software, that is, with versioning, tracked branching and merging, automated workflows, and so on.
Stephen concluded by giving an overview of the SageCite Project , which is bringing together the Sage data commons, Taverna workflows (via the myExperiment community), digital object identifiers (via DataCite) and journals to create a system of persistent identification for data alongside a credit system aligned to this new, more effective way of conducting science.
The International Digital Curation Conference seems to get bigger and better each year. This year was no exception: at the last count there were over 270 delegates, more than ever before. The inspirational nature of Chris Lintott's opening keynote set the tone for the entire conference; while there was no loss of realism about the size of the task facing data curators, especially in difficult economic times, there was nevertheless a palpable positivity about what has already been achieved and what is now within reach. It was notable that the emphasis this year had shifted away from long-term preservation and towards the more contemporaneous aspects of curation, in particular the removal of technical, legal and procedural barriers to integrating, refining and using data, for both professional and citizen scientists.
The NSF data management plan mandate was not quite the elephant in the room that Clifford Lynch described. It did cast a long shadow over proceedings, but rather than shy away from it, my impression was that people recognised it as an opportunity to push the curation curriculum agenda, both in terms of researchers and professional curators. Which is not to say people were not daunted by the rapidity with which action must be taken.
All in all, I came away from the conference enthused about the place digital curation has taken within mainstream academia, and looking forward to attending again next year.
The conference was opened up for remote participation as never before, so as well as being able to browse through the slides and posters on the conference Web site  one can also view recordings of the keynote addresses, peruse specially recorded interviews with some of the speakers , and sift through the many tweets posted about the event .