Alex Ball reports on a workshop on practical data citation issues for institutions, held at the British Library, London, on 8 March 2013.
On Friday, 8 March 2013, I attended the fifth in the series of DataCite workshops run by the British Library . The British Library Conference Centre was the venue for this workshop on the theme 'Making Citation Work: Practical Issues for Institutions'. I counted myself lucky to get a place: the organisers had had so much interest they had started a reserve list for the event. I could believe it as it was standing room only at one point, though an awkwardly placed pillar may have contributed to that.
Following the words of welcome from Caroline Wilkinson, the day consisted of three presentations and an extended discussion session.
Working with DataCite: A Practical Guide
Elizabeth Newbold and Caroline Wilkinson, British Library
For the benefit of anyone new to DataCite, Elizabeth Newbold gave an introduction to the organisation, its motivations and objectives. DataCite was formed to provide a robust identifier scheme for datasets, and to achieve this it opted to become a Digital Object Identifier (DOI) registration agency. Its members are mostly national and large academic libraries. Each member acts as the DOI Allocating Agent for its own country or, in the case of Germany and the US, a sector within a country. DataCite members such as the British Library (BL) do not work directly with researchers but with data centres and institutional data repositories. This is because DOIs are meant to signify a long-term commitment to preserving data and making them available, something that can only be provided at an organisational level.
The BL negotiates a contract with each of its client organisations. While these contracts are bespoke they have some common features: a three-year term, and commitments from the client to provide data landing pages containing the mandatory DataCite metadata , to ensure DOIs continue to point to the right landing pages, and to provide persistent access to data as appropriate. Clients can demonstrate these commitments through preservation plans, data management policies, service level agreements, mission statements, or similar.
Caroline Wilkinson explained how client organisations mint and manage their DOIs using the DataCite Metadata Store. The Metadata Store has both a Web interface and an application programming interface (API); the latter is useful for batch operations and for integrating DOI management into local systems. On the day of the workshop, a new video had just been released which demonstrated how to use the Metadata Store . A key point to note was that a dataset cannot be given a DOI without the client supplying at least the mandatory DataCite metadata.
Preparing the Repository for DOIs (or Building One from Scratch)
Tom Parsons, University of Nottingham
The ADMIRe Project at the University of Nottingham has been gathering requirements for the institution's forthcoming data repository . Some requirements come directly from the IT policies in place at Nottingham, but as Tom Parsons explained, the project has focused on understanding the needs of research staff. It therefore conducted a survey and a series of interviews and focus groups.
The survey revealed some considerable difficulties, not least that 60–77% of the data assets known to respondents had no associated metadata. There was also a low compliance rate with funder requirements to share data. Digging deeper, the focus groups revealed that researchers were most concerned about file storage, metadata, data identifiers, and standard vocabularies.
The architecture of the data repository will be based on the micro-services approach promoted by the California Digital Library . Tom picked out two of particular interest: the 'Tag' service and the 'ID' service. The minimal metadata to be demanded by the Tag service closely mirror those demanded by DataCite: creator, title, publisher (i.e. the university), publication year, identifier (assigned), subject, research grant code, location, associated research paper. The ID service will run two schemes in parallel, one internal and one external. DOIs were selected for the external scheme, as they explicitly satisfy EPSRC's fifth Expectation  and are familiar to researchers through their scholarly journal publications. DOIs will only be registered on request; some of the University's data assets have sensitive metadata, and thus cannot be registered with DataCite.
Now that the requirements for the data repository are understood – each micro-service is scoped by around 10–15 requirements – development work was scheduled to begin on 27 March 2013.
Engaging with Researchers at Exeter
Gareth Cole, University of Exeter
The Open Exeter Project is developing policies and a technical infrastructure for research data management . As Gareth Cole was keen to stress, a cornerstone of the project was how it has engaged with multiple stakeholders: senior management, academic staff, post-graduate researchers, central and embedded IT support staff, library staff, administrators and so on. This engagement was performed both through dedicated channels, such as a policy task-and-finish group and a DAF (Data Asset Framework) survey, and through existing channels such as departmental seminar series, staff meetings and college workshops.
The project team is confident this programme of engagement has given them an accurate picture of current practice and of stakeholder requirements for both the policy and infrastructure strands of its work. It is also giving researchers a sense of ownership over the project deliverables, making it more likely they will abide by the policies and use the infrastructure in future.
Language proved to be a problem. Disciplines at the Arts and Humanities end of the spectrum tended not to think of their 'digital objects' as data. Some researchers only recognised data sharing in the context of active data, not the archival context.
The University of Exeter has now approved the Open Access Research and Research Data Management Policy for PGR Students developed by the project . The policy specifies that data must be securely preserved and registered with the University's data repository, but does not specify a persistent identifier scheme.
Questions and Answers
The discussion section of the workshop was chaired by Caroline Wilkinson and lead by a panel consisting of Michael Charno, Archaeology Data Service (ADS) at the University of York, David Boyd, University of Bristol, and Sarah Callaghan, British Atmospheric Data Centre (BADC) at STFC. The ADS has registered the most DOIs of any UK repository, while the University of Bristol was the first UK Higher Education institution to mint a DOI.
Sarah said BADC had chosen DOIs as an identifier scheme because of the reputation they had among researchers as a stamp of authority, and because they were seen as more trustworthy than BADC's existing catalogue page URLs. David said that for the University of Bristol, using DOIs was a sign of its commitment to maintain the infrastructure set up by the Data.Bris project; it also helped that DOIs, more so than other identifiers, were well known and understood among researchers. Michael explained that the closure of the Arts and Humanities Data Service (AHDS) served as a warning to ADS about the vulnerability of domain names, and DOIs seemed like a good way to introduce a layer of indirection. As archaeology is a field with a culture of reuse, it is easy to make the case to researchers for data citation as a way of promoting datasets.
There was some discussion about the relative merits of DOIs and plain Handles (DOIs are a special type of Handle identifier). It was emphasised that DOIs have greater mind share and recognition among researchers and publishers, and an additional layer of governance in the form of the International DOI Foundation. DOIs have a different 'implicit contract': they imply a static version of reference, while Handles can happily be used for more dynamic or ephemeral entities. DOIs also have additional services associated with them, such as DataCite's Metadata Store for data discovery, made possible through the enforcement of a metadata standard.
On the old chestnut of institutional versus subject-based repositories, there was a general consensus that the two can co-exist, with the institutions supporting subjects that do not have their own repositories. Michael noted that the ADS is happy to prepare and package submitted data and send them back to institutions to archive. Where a dataset has two (or more) possible homes, the consensus was that the metadata should be held in each place but only one repository should hold the actual data (with this location recorded by the others). It is the repository that holds the data that should manage the DOI registrations.
A variation on this was introduced into the conversation by Robin Rice, University of Edinburgh: what about data that was being hosted in a custom, dedicated repository by a university department? Could the institutional repository register DOIs on behalf of the departmental repository? Should it ingest copies of the data and register those as the version of record, despite the departmental repository providing a better user experience? The ensuing discussion threw up some interesting points. The contract would be between the BL and the university, not the institutional repository, so the university could have separate prefixes for each of its repositories. If the departmental repository had to close because the researcher running it had moved on, the DOIs could follow the researcher to a new institution, or the institutional repository could ingest the data at that point and take over control of the DOIs. After further probing, it began to look like DOIs may not be the best solution for the use case Robin had in mind.
Several participants were concerned about the business case for institutions subscribing to DataCite. David reassured them that the annual cost was reasonable, not too dissimilar to that of a journal subscription, and there had been no problems making the case for it at Bristol. Sarah added that the subscription fee was dwarfed by the staff costs associated with ingesting data into a repository.
The panel members were asked if they had policies on which sorts of object could receive DOIs. At ADS, the matter was decided by the types of object for which its catalogue system could generate landing pages. At Bristol, the current policy is that DOIs are given to the data underlying published papers. If a dataset is given a DOI but later needs to be changed in the light of a peer reviewer's comments, a new version has to be registered with a new DOI. At BADC, DOIs are used to mark frozen datasets that have high-quality metadata.
Sarah explained that, in terms of granularity, the key thing is that datasets marked with a DOI form a scientifically meaningful, complete entity. This might mean everything from a project is kept together as a single object, or the outputs might be separated out into distinct but related objects. While it would be possible to mint DOIs at several different levels of granularity – one for a dataset and one each for several subsets, say – Sarah warned this might dilute the apparent impact of the dataset by diverting some of its citation count to its subsets or supersets.
A few additional matters were raised and dealt with fairly quickly. One participant's concerns about use of the DataCite API at scale were allayed. It was clarified that dataset landing pages need to contain a statement about accessing the data, but need not provide a link for direct download. Michael explained how the DOI resolver supports content negotiation for DataCite DOIs: in other words, it can be coaxed into telling you, say, the MIME type of a dataset instead of redirecting you to the landing page. It was also pointed out that if a repository has to maintain several snapshots of a dynamic dataset, each with a different DOI, this does not mean data have to be duplicated across all the snapshots: the snapshots could be assembled on demand from a single master corpus or sequence.
One of the organisers confessed to me they were unsure the format of the workshop would work: it is rare to have so much time devoted to free discussion. In this case I think it paid off. As the reader will gather from the above (incomplete) summary, it gave participants the chance to air a wide variety of concerns and solicit advice on some knotty real-world problems. Even so, I suspect that the most valuable thing to come out of this workshop, at least for institutional data repositories, is the information that was given about those all-important DOI registration accounts and associated contracts. I can see why such information has been slow to enter the public arena, but I cannot help feeling that this has kept potential clients away. If nothing else, I hope this workshop has provided reassurance to institutions who wanted to know about being able to mint DOIs, but were too afraid to ask.
- British Library DataCite workshops http://www.bl.uk/aboutus/stratpolprog/digi/datasets/dataciteworkshops/
- DataCite Metadata Schema for the Publication and Citation of Research Data, Version 2.2, July 2011 http://dx.doi.org/10.5438/0005
- British Library, DataCite: Information for Potential Clients http://bit.ly/DataCiteFAQ
- ADMIRe Project blog http://admire.jiscinvolve.org/wp/
- Abrams, S., Kunze, J., Loy, D., “An Emergent Micro-Services Approach to Digital Curation Infrastructure”, International Journal of Digital Curation 5(1): 172-186, 2010 http://dx.doi.org/10.2218/ijdc.v5i1.151
- EPSRC, Expectations http://www.epsrc.ac.uk/about/standards/researchdata/Pages/expectations.aspx
- Open Exeter Project blog http://blogs.exeter.ac.uk/openexeterrdm/
- University of Exeter, Open Access Research and Research Data Management Policy for PGR Students http://hdl.handle.net/10036/4279
Alex Ball is a Research Officer working in the field of digital curation and research data management, and an Institutional Support Officer for the Digital Curation Centre. His interests include Engineering research data, Web technologies and preservation, scientific metadata, data citation and the intellectual property aspects of research data.