Developing Infrastructure for Research Data Management at the University of Oxford

james a. j. wilson; michael a. fraser; luis martinez-uribe; paul jeffreys; meriel patrick; asif akram; tahir mansoori

Developing Infrastructure for Research Data Management at the University of Oxford

James A. J. Wilson, Michael A. Fraser, Luis Martinez-Uribe, Paul Jeffreys, Meriel Patrick, Asif Akram and Tahir Mansoori describe the approaches taken, findings, and issues encountered while developing research data management services and infrastructure at the University of Oxford.

The University of Oxford began to consider research data management infrastructure in earnest in 2008, with the ‘Scoping Digital Repository Services for Research Data’ Project [1]. Two further JISC (Joint Information Systems Committee)-funded pilot projects followed this initial study, and the approaches taken by these projects, and their findings, form the bulk of this article.

Oxford’s decision to do something about its data management infrastructure was timely. A subject that had previously attracted relatively little interest amongst senior decision makers within the UK university sector, let alone amongst the public at large, was about to acquire a new-found prominence. In November 2009, the email archives of researchers within the Climate Research Unit at the University of East Anglia were compromised and leaked to the Web by climate change sceptics, alongside commentary casting doubt upon the scientific integrity of the researchers involved. The story was seized upon by the national press and entered the public consciousness [2]. Although subsequent enquiries have exonerated the researchers from scientific malpractice, criticism has been levelled at their data management practices. The report of Lord Oxburgh’s scientific assessment panel in April 2010 observed that the researchers:

‘should have devoted more attention in the past to archiving data and algorithms and recording exactly what they did. At the time the work was done, they had no idea that these data would assume the importance they have today.‘

The report adds that ‘pressing ahead with new work has been at the expense of what was regarded as non-essential record keeping’ [3].

The case brings home the importance of managing one’s research data well. The sympathy many researchers undoubtedly felt for their colleagues at the Climate Research Unit is unlikely to stretch indefinitely into the future, especially if, as seems likely, funding bodies start to insist, rather than simply recommend, that research paid for by public money is open and accountable. But to what extent should researchers be expected to redirect their time from research to data management? Clearly there is a cost involved, and one which explains the reluctance to devote resources to what might look like ‘non-essential record keeping’. The value of researchers to their institutions is measured in publications, not orderly filing cabinets.

In practice, good data management requires the input of the data creators, as they are usually the only ones in a position to accurately document their methods and outputs. The researchers’ university can, however, play its part in ensuring that this is as easy as possible and that documentation is of an appropriate standard. At present, few universities do. At the very least, infrastructure needs to be available to enable researchers to meet the requirements laid down by funding bodies.

In some academic disciplines cross-institutional data repositories and standards have already been established. It is already common practice for data to be shared between groups working in astronomy or crystallography, for instance, despite the competitive pressures that exist between institutions [4]. In other fields, funding bodies have established subject repositories with data curation responsibilities beyond institutional boundaries. The UK Data Archive at the University of Essex is one example, whilst the Natural Environment Research Council (NERC) supports a network of data centres across its disciplinary areas [5].

But if researchers and funders can set up their own infrastructures for managing data, why should universities take on this responsibility? Firstly, especially in the UK since the demise of the Arts and Humanities Data Service (AHDS), there are questions about the long-term sustainability of such services. Universities have diverse funding sources and have a good track record of long-term information preservation (over 800 years in the case of Oxford), whereas discipline-specific data centres are less protected from the whims and politics of funding agencies and governments, or so the argument goes. Secondly, there are legal issues to consider. Universities such as Oxford take ownership of the data produced by their researchers (up to a point), and therefore have responsibilities for it. The exact nature of any given university’s intellectual property rights regarding research data varies, but at a purely practical level it is likely to prove embarrassing for a university if the data produced by its researchers cannot withstand reasonable scrutiny. Thirdly, there are issues relating to the research life-cycle. Researchers need to start considering data management questions from the inception of any new project, before even making a funding bid. Service providers within the university are arguably better placed to give advice and support at this very early stage than data centres, whose primary concern has traditionally been with curating and disseminating the data that they ingest later in the cycle.

Before going further, it is perhaps worth indicating what is meant here by ‘data management’. Data management means different things to different people. Data repositories tend to think of it primarily in terms of preservation and curation, but researchers are more likely to associate the term with structuring data in a database, or the organisation of files and folder. At its broadest, research data management involves all the processes that information from research inputs undergoes as it is manipulated and analysed en route to becoming a research output. Research inputs may include everything from journal articles, through 15th-century manuscripts, to images generated by electron microscopes. Research notes, whether hand-written or electronic, are also an essential part of this research process.

It is this broadest definition of research data management that informs our approach to data management infrastructure. The data filing system on the desktop or in the Web 2.0 ‘cloud’ should be regarded a continuation of the data filing system on the desk, the study floor, or occasionally the cabinet. Part of the aim of instilling good practice in data management is to expose the connections between heterogeneous data sources that more often than not exist only in the mind of the researcher. To this end, research data management has much in common with the management of private archives, only we would hope that management takes place at the beginning of the research career, rather than towards the end (or indeed post-mortem).

Institutional Strategy

The overall approach taken by the University of Oxford to the development of infrastructure, policy, and business planning to support research data management is one that is consistent with the University’s adherence to the principle of subsidiarity: that decisions should be taken at the lowest appropriate level, which, in the context of research, would tend to be by the researcher, research group or academic department, rather than being imposed by central service providers or decision-making bodies.

Given that Oxford has a highly federated organisational structure, however, it is recognised that an element of central coordination is essential to the development of institutional strategy. Within Oxford the lead has been taken by the Computing Services (including the Office of the Director of IT). At other institutions, it may be the library or research administration services. All are likely to bring their own emphases (libraries focusing on preservation and open access; IT services caring about infrastructure to support day-to-day working; research services instinctively supporting funding body compliance).

The centrally-coordinated activities at Oxford have recognised these different priorities by trying to ensure that support for research data management is a distributed, ‘multi-agency’ activity, including, of course, researchers and research groups. Indeed, each project has encompassed a significant element of end-user requirements gathering, either working with specific research groups, an academic division, or across many disciplines.

The overall activity is embedded within the IT governance structure, reporting to the ICT Subcommittee of the Policy and Resources Allocation Committee (PRAC ICT or ‘PICT’) and the Research Information Management Sub-Committee of the University’s Research Committee. Both sub-committees are chaired by pro-vice chancellors. In 2009 PICT approved a statement of institutional commitment to developing services for the management of research data:

‘The University of Oxford is committed to supporting researchers in appropriate curation and preservation of their research data, and where applicable in accordance with the research funders’ requirements. It recognises that this must be achieved through the deployment of a federated institutional data repository. This repository has to be supported by a suitable business model, and where possible funded through full economic cost recovery, so that the University can guarantee that the data deposited there will be managed over the long term. The data repository will be a cross-agency activity developed and supported by a number of departments within the University and will build, as far as possible, on existing services, including the Oxford University Research Archive (ORA). It will be overseen by a Steering Group which reports to the University Research Committee. The management and curation of research data will be addressed in cooperation with specialist agencies, research funders and other institutions in the UK and internationally. Oxford is committed to playing a significant role within the foreseen UK Research Data Service Pathfinder activities’ [6]

JISC funding has played an important role in assisting Oxford in making concrete the above vision, and it will also help ensure that the outputs of the projects, including software, training materials, and knowledge generated, will be shared with other institutions.

The development of research infrastructure takes time, and it will still be several years (funding permitting) before the University is able to claim an integrated suite of services to support research data management. This article describes the approach taken so far, the progress made, and the lessons learned.

Scoping Digital Repository Services for Research Data Management

The University of Oxford’s initial ‘Scoping Digital Repository Services for Research Data’ Project was an intra-institutional collaboration between members of the Office of the Director of IT, Computing Services, Library Services and the Oxford e-Research Centre. Throughout 2008 the study embarked on a range of activities to scope the requirements for services to manage and curate research data generated in Oxford.

Thirty-eight interviews were conducted, covering a cross-section of departments and researcher roles, and exploring data management practices and researchers’ needs at Oxford. In addition to the interviews, a data management workshop was staged to further explore and confirm requirements. The interviews and the workshop revealed a wide variety of data management practices, ranging from the consistent and proficient to rather more ad hoc activities conducted by researchers with little support and few resources [7]. The extraordinarily good response from researchers suggested that the issues were of great interest to many scholars.

These initial requirements-gathering exercises demonstrated that whilst aspects of data management are common across academic divisions, there are also requirements specific to particular research domains. Researchers in the life sciences, for instance, can generate large amounts of digital data from instruments, which in many cases have a short lifespan. Humanities data, on the other hand, can be of a very different nature, typically having a much longer lifespan but without the sheer size issues. Requirements for institutional data services can likewise differ: many humanities researchers require access to sustainable database provisioning systems to create, analyse, and share their data; life sciences researchers placed more emphasis on secure storage for large volumes of data and the capacity to transport the data quickly and reliably for analysis, visualisation, and sharing. Obviously not all research activities within those broad disciplinary areas fit these characterisations; there are a plethora of rich details in every research activity data practice. Of the data management requirements that were common across disciplines, the most evident was the need for support in the form or advice and training.

A similar exercise was undertaken with service providers in Oxford. First, a group of these central and divisional level agencies within the University were consulted to validate the researchers’ requirements and to determine the services on offer. A workshop was then organised that brought together Oxford services and data service providers from other institutions to discuss institutional roles in supporting data management. The data management and curation services framework shown below was devised to engage service units in the discussions, understand services on offer, and start drawing the division of responsibilities amongst them.

Data management sharing plans

Legal & ethical

Best formats & best practice

Secure storage

Metadata

Access & discovery

Computation analysis & visualization

Restricted sharing

Data cleaning

Publication

Assess value

Preservation

Add value

Support

Infrastructure and tools

Policy

Business model

Data management and curation services framework

Many service units participated in this consultation including the Library Services, Computing Services, Research Services, the e-Research Centre and post-doctoral training centres [8]. All had complementary expertise in many of the areas of the above framework. However, the exercise also revealed that data management support is generally provided on an ad hoc basis, and many of the services in the framework are not offered fully or at all. Many of the service units pointed out the importance of developing institutional policy to promote research integrity and having business models to outline the economic and sustainability aspects of research data services. These elements were perceived to be key components required to build the data management infrastructure required at Oxford.

These requirements also formed the basis of the Oxford input for the UK Research Data Service (UKRDS) feasibility study [9] to investigate the demand for, and feasibility of, a UK-wide data management service. Moreover, the study also served to test the Data Audit Framework (DAF) methodology [10] through participation in the DISC-UK DataShare Project [11].

The scoping study helped to raise awareness, identify high-level priorities (including gaps in existing provision), and gain senior management support across service units. Furthermore, it helped identify and engage with a number of researchers who use and generate data, many of whom were willing to participate in further activities where their particular issues could be addressed.

Embedding Institutional Curation Services in Research

Taking the priorities identified by the initial scoping study, a project proposal was developed that combined embedding infrastructure services in research with policy and sustainability development. The proposal was submitted as a preservation exemplar project, with the exemplar components being:

recognising that research data management belongs to no single part of the institution but rather involves a partnership of researchers, their departments, IT, library, and research support services;
bringing together bottom-up and top-down approaches through embedding joined-up infrastructure services within the research process, on the one hand, and the development of institutional policy and costing models on the other;
developing software, training and other supporting services with longer-term production and sustainability in mind.

As a result of the proposal, in Autumn 2008, research groups from the medical and life sciences (which had participated in the initial study), together with the service units, brought their complementary areas of expertise to the service of a new data curation activity at Oxford. The resulting project, Embedding Institutional Data Curation Services in Research (EIDCSR) [12], was funded by the JISC under the Information Environment Programme.

Research Groups’ Workflows and Requirements

This new data activity began with a requirements analysis phase [13]. This involved interviewing researchers (using the DAF methodology [14]) to learn about their specific data practices and workflows. The results of this exercise would feed into the development of policy and cost-price models as well as the implementation of data management tools.

The three research groups participating in EIDCSR (the Computational Biology Group, the Cardio Mechano-Electric Feedback Group, and the Department of Cardiovascular Medicine) were collaborating on an existing project informally entitled the ‘3D Heart Project’. The objective of this project was to create three-dimensional computer models of hearts upon which in-silico experiments could be conducted. In order to create these models, the researchers involved need to combine a variety of techniques ranging from histology, magnetic resonance imaging (MRI), image processing, and computer simulation [15]. The 3D Heart Project was funded by the Biotechnology and Biological Sciences Research Council (BBSRC), whose policy requires the researchers to preserve their data and ensure it remains available and accessible for ten years after the completion of the project.

The research process begins with image acquisition within the laboratory, where histology data and two types of MRIs are generated. The histology data for each subject represents a stack of high-resolution images, each stack being up to 1.6 terabytes in size. The MRIs are not as large, and they complement the histology data, providing more detailed structural information. These images are processed to generate 3D volumes, which then go through a process of segmentation and mesh generation that will finally allow running simulations of cardiac activity. This specific data management workflow involves:

generating a variety of data in the laboratory from a range of instruments;
using a set of tools for manipulating and visualising the data;
processing the laboratory data to generate three-dimensional models of the heart;
storing the data in different locations including project servers, desktops and DVDs with various back-up strategies;
sharing the data in different ways amongst the groups and other collaborators;
publishing research papers based on results obtained from the data and making the data available.

The data management requirements from these research groups include: the capacity to move large quantities of data across the network for analysis and sharing; having access to secure storage with back-up and off-site copies; and provision of tools and support to record provenance information such as laboratory protocols, parameters and methodologies.

Data Archiving and Rediscovery

Given that the EIDCSR Project is attempting both to meet the needs of the particular research groups with whom we are working, and to establish infrastructure intended to be more broadly useful for future researchers working on projects as yet unknown, a flexible metadata schema that could accommodate the varying needs of different research groups and domains was deemed essential. It also became apparent during the early stages of the project that the metadata would need to be able to evolve during the lifespan of the archived objects if it was to remain accurate.

The data archiving and discovery side of the EIDCSR Project has two distinct software components: one that enables the project team to archive data (and any annotations added to that data); and another that allows the querying of the metadata.

The EIDCSR archiving client consists of a Graphical User Interface (GUI) via which researchers can archive datasets from their networked storage to the Computing Service’s IBM Tivoli Hierarchical File System (HFS). The GUI is a C/C++ application, using C for the business logic and Qt for the user interface. The EIDCSR client works by parsing the research project’s computer folders for data files and corresponding metadata information, recorded as ‘readme’ files. The metadata, if available, is then stored independently in a repository supported by Oxford University Library Services, whilst the datasets themselves are archived to the HFS.

The essential task of the EIDCSR client is to assign a unique identifier to individual image datasets and the corresponding metadata so that the two remain linked. It can also be used to create and add metadata relating to the project and other types of research data.

If the readme file for any given set of images is missing, or lacks core information, the client alerts the archiver and displays a metadata editing form so that the missing fields can be completed. A small minimum set of metadata needs to be added before the data can be archived. It is possible to archive the data without completing many of the metadata fields but this is likely to impact on future discoverability and therefore limit reuse. Projects that have already archived their data to HFS can also use the EIDCSR client to add or edit existing metadata.

The second software component of EIDCSR is an online search interface for querying the metadata repository to discover archived data. This will enable searching by various fields, such as project description, team members, experimental details, or even publications arising from the data (assuming the relevant metadata has been included). Once a relevant project or dataset has been located, the searcher may then send a request to that project’s principal investigator or data manager to retrieve the archived data. Provided that the request is approved and there are no embargoes or other access restrictions on the data, it may then be uploaded from the HFS to an appropriate accessible location via the EIDCSR or standard HFS client.

EIDCSR is at present working with researchers and library staff to develop a ‘core’ set of metadata fields that is likely to be common across many research projects, whilst also catering for the addition of custom metadata fields that can store information particularly relevant to individual research projects. This will take the form of an extensible XML schema.

Visualisation and Annotation Tools for Large Image Datasets

Whilst the EIDSCR archiving and metadata-editing tools address some of the identified researcher requirements, the sheer size of some image data poses other problems. In the 3D Heart Project, the MRI and histology images are generated by different research groups in different locations. The images they produce then need to be sent to the researchers at the Computational Biology Group, who must rapidly evaluate their feasibility for computational processing and give any feedback for improving image quality. It is difficult for research collaborators to immediately access full or partial image datasets without downloading terabytes of data, which takes significant time and bandwidth. Furthermore, all three groups need to look at multiple datasets to identify areas of interest and particular physiological features.

To address this issue, the EIDCSR Project has been developing visualisation software that enables users to quickly browse raw image datasets via a Web-based system. The system allows users to:

upload 2D and 3D image datasets which are converted to a hierarchical quad-tree pyramid for fast access over the Internet;
visualise large 2D and 3D datasets;
adjust various image visualisation parameters to enhance contrast, brightness, colour channels and perform simple colour thresholding to identify areas of physiological importance;
measure various physiological features present in the image;
access datasets through authenticated sessions. An administrator can specify the access level for each user.

The initial testing of the system was conducted using the datasets generated from the 3D Heart Project. The largest dataset, which consists of 4GB of 3D MRI (26.4 x 26.4 x 24.4 um spacing with 1024 x 1024 x 2048 voxels) and a corresponding 1.6 TB of histology (1.1 x 1.1 x 10 um spacing with 35000 x 20000 x 1850 voxels) obtained from a rabbit heart, was used as a test case.

The datasets, when uploaded through a browser, are partitioned and stored as quad-tree pyramids. For quad-tree pyramid generation, each high-resolution image is recursively divided into four regions until a tile size of 256 x 256 is obtained. At each step the full image is reduced further by a factor of two in each dimension to obtain the next level of the quad-tree. This process is repeated for each image until the whole pyramid is generated. A 50% lossy JPEG compressed pyramid consumes only 4% storage space compared to the uncompressed raw image volume, whilst keeping the visual quality of the compressed dataset indistinguishable from its original. The compressed dataset provides an efficient representation for displaying large images over the Internet, significantly reducing the bandwidth requirements for visualisation. For the particular test dataset, the quad-tree consists of 16 million nodes representing the whole 3D heart.

This partitioned dataset is available through a Flash-enabled browser using a zoomable interface similar to Google Maps for visualisation of 2D data. The visualisation of 3D data is provided through a multi-axis view showing axial, coronal and sagittal planes. Measurement, image analysis, and administration tools are integrated in to the Web application.

Policy and Cost-price Models Development

A significant component of the EIDCSR Project was to develop and implement institutional policy on data management as part of a wider programme to promote research integrity [16]. This policy would also look to ensure compliance with funders’ requirements.

The University of Melbourne began developing a similar policy in 2005, so Oxford sought to learn from their experiences, seconding Dr. Paul Taylor to assist with creating the draft. This work included interviews with academics and administrative staff, an assessment of current policy and guidance in the UK, and a second workshop attracting national participation. Outputs included the planned draft policy document and supporting recommendations, one of which was for the Research Services Office to develop a Web portal to assist academics with discovering existing services (local and national) [17] and policy for managing research data. The draft policy document is currently undergoing a consultation process.

Participation in another JISC-funded project, Keeping Research Data Safe 2 (KRDS2) [18], provided the opportunity to look into the economics of data management. Detailed cost information was gathered about the creation and local management of data by the 3D Heart Project, as well as the curatorial activities undertaken by EIDCSR. In this case study, the results showed that the cost of creating the actual research datasets, including staff time as well as the use and acquisition of lab equipment, was proportionally high, representing 73% of the total cost [19]. The costs of the admittedly limited local data management activities undertaken by the researchers on the other hand were modest, representing only 1% of the total. The curatorial activities undertaken as part of the EIDCSR Project constituted 24%. A large proportion of these curatorial costs were however derived from ‘start up’ costs: the general analysis work to capture requirements, metadata management research, the technical implementation of tools, and the policy development work, which would not be significant in future research data management activities. The remaining 2% was devoted to the already existing back-up and long-term file store service provided by Oxford University Computing Services. The costs of established curatorial services are low per dataset, and therefore it is anticipated that the cost of curation services will decrease over time and with economies of scale.

Supporting Data Management Infrastructure for the Humanities

The second JISC-funded project to arise from Oxford’s drive to develop its data management infrastructure is entitled ‘Supporting Data Management Infrastructure for the Humanities’ (SUDAMIH) [20]. SUDAMIH is complementary to EIDCSR and uses the same framework, namely:

scoping the project involving multiple institutional service providers working in partnership with an academic community;
defining the academic community’s priorities to better enable research data management whilst maintaining a longer-term focus on services and infrastructure that will scale and broaden to encompass other types of research activity;
ensuring that infrastructure and service development is symbiotic with the development of policy and sustainability.

Whilst the EIDCSR Project works within a fairly narrow disciplinary area, SUDAMIH has a broader focus, considering the needs of the entire Humanities Division. The pre-proposal phase of requirements gathering identified two significant areas for development: the provision of training in the management of research data, and the development of an infrastructure service to support the creation and management of database applications in the humanities [21].

Work on the training strand involves piloting a range of materials (modules, introductory presentations, online support) with researchers at Oxford, in collaboration with the Digital Curation Centre (DCC) and related initiatives. The other strand involves the development of a ‘Database as a Service’ (DaaS) infrastructure, which will enable researchers to quickly and intuitively construct Web-based relational databases. As an online service, the DaaS will support collaboration and sharing, both within research groups and with a wider public. Whilst initially developed for humanities scholars, the aim is ultimately to develop infrastructure that can be expanded to meet the needs of other academic divisions.

Humanities Researchers’ Data and Requirements

There are over 1,500 researchers based in the Humanities Division at Oxford, working on a huge number of different projects [22]. Although there is at present a trend towards more collaborative working in the humanities, there are still many ‘lone researchers’, each working on their own areas of personal interest. It was therefore imperative for SUDAMIH to speak to a broad cross-section of the community to understand existing practices and requirements.

The project conducted thirty-two interviews in total, twenty-nine of which were with active researchers [23]. The researchers fell into two broad camps - those who clearly worked with structured data and recognised it as such, and those who did not consider themselves to be working with ‘data’ at all. Whilst all humanities researchers faced issues about structuring and storing their information in such a way as to ensure that it was at their fingertips when needed, many did not conceptualise this as working with data. From this it became apparent that the project would need to couch training in terms that would be familiar and attractive - emphasising the day-to-day problems that researchers encounter when dealing with their material, rather than talking about data management in generic terms.

Despite the huge variety of research being conducted in the humanities at Oxford, the interviews did identify a number of characteristics of humanities research data that would need to be considered when providing support. Principal among these was the fact that much humanities data has a very long life-span and that its value does not depreciate over time: a database of Roman cities will potentially be of as much use to researchers in fifty years time as it is today, provided it is not rendered obsolete through technological change. Humanities scholarship often aggregates to a ‘life’s work’ body of research, with any given researcher often wishing to go back to old datasets in order to find new information. It is therefore important that infrastructure for humanities data can guarantee preservation and curation potentially indefinitely. This of course places a huge sustainability burden on any centrally-provided infrastructure, although an infrastructure than can bear this responsibility offers a great improvement on the often ad hoc, local, and short-term preservation of data that is currently the norm.

Certain characteristics of humanities data make its re-use particularly difficult. The data is often messy and incomplete, being derived from diverse sources that were never intended to provide information in a regular or comparable format. Information is typically compiled from assorted primary texts, by diverse hands and covering different topics, with differing degrees of reliability, so a degree of interpretation is usually required in order to structure it in a format that allows analysis. Whilst the original compiler is aware of the procedure they used to do this, anyone else coming to the data later requires a significant degree of explanation or documentation to ensure they do not misinterpret it. Cleaning and documenting data is time-consuming, so any infrastructure that enables re-use must make these processes as easy and natural as possible.

The requirements gathering exercise did identify several issues that were commonly encountered by humanities researchers where training and other aspects of infrastructural support could potentially make a difference. These included:

Most researchers misplaced information from time to time, which slowed their work. Time spent hunting for the right quotation or reference interrupted the research process.
Some researchers occasionally ‘lost’ chunks of electronic data due to poor backing up or storage processes.
Data stored electronically can become obsolete due to technological change, causing problems when researchers wanted to refer back to it.
Information can be (or become) disconnected from associated or contextual material - this can make tasks such as searching for images, or tracking information back to its source, difficult.
Researchers were often simply unaware of the existence of technologies that might help them to structure and analyse their information, or did not understand how such tools might be able to assist them.

In addition to these challenges, some researchers were keen to learn how to structure their information and data better as they felt it might have a positive effect on their ability to conceptualise their work. One philosophy lecturer commented that, ‘I do believe that our research could be enhanced by having better ways of storing information, because the way I store my thoughts makes a difference to how I use them when progressing in my thinking. I can see that improving the way I store them might help the actual thinking – apart from saving time, it might be a bit more substantial than that: having a clear view of what I’ve already done, or how my different projects interconnect, might just be heuristic in a sense.’

Data Management Training

Besides the interviews with researchers at Oxford, SUDAMIH arranged a workshop with contributions from national bodies such as the DCC and Research Information Network in order to get a wider perspective on data management training for the Humanities [24]. The majority of the interviewees and workshop delegates agreed that there was a need for data management training within the humanities and also upon various areas where current practices could be improved. The need for training was regarded as common to researchers at all levels, although it was felt to be particularly beneficial for graduate students. Although the 2002 Roberts’ Report [25] resulted in a substantial increase in the training offered to graduates, data management has not yet received as much attention as other areas.

The interviews indicated that the desired training fell into two broad categories: general data management issues, which include the organisation and storage of paper and electronic research materials for efficient access and re-use, and training in the use of specific data management tools, such as database packages and bibliographic software. The latter is to some extent already covered by the wide range of courses offered by Oxford’s IT Learning Programme, although there is room for expansion into some more specialised aspects (such as, for example, designing databases which can deal with complex data such as vague dates, variant spellings, and fragmentary texts). There is also substantial interest in the overlap between the general and more specific areas: many researchers would like to know more about the range of data management tools available to them, and how to select the appropriate tool(s) for a given project.

However, the interviews also revealed a significant barrier to improving data management practices: that of persuading academics to take time out of what are invariably very busy schedules for training. Data management tends to be regarded as a lower priority than more immediately beneficial skills, being important, but not necessarily urgent. In order to overcome this, advice about data management might usefully be embedded in compulsory research methods training for graduate students. It is also essential to highlight the benefits to the researcher of good data management, rather than simply presenting it as something imposed by funding bodies or university policy: efficient organisation and use of appropriate tools means less time spent hunting for information, which in turn means more time available for analysis and original thought.

Database as a Service

The DaaS system provides on-demand institutionally-hosted databases and a generic Web portal to access them. The Web portal essentially mirrors the underlying database structures, using Data Definition Language (tables, columns), and enabling data population and querying. An important rationale behind the DaaS is to allow researchers to concentrate on their own research interests rather than IT infrastructure. Managing geospatial data and multimedia content in a database can be a complicated process, and SUDAMIH shields users from such underlying complexities.

The DaaS caters for two basic routes to creating a database:

Migration and porting: for researchers who have already developed a local, standalone database in one form or another and want to open it up to a wider audience.
Design and creation: for researchers starting a new project who wish to straightforwardly design, implement and share a database.

Microsoft Access databases are particularly popular in the arts and humanities, due primarily to their ease of use, straightforward user interface, and the software’s ready availability on university computers. An Access database is arguably not the optimal solution for collaborative multi-user database environments, however, nor for serving a large number of simultaneous users. As research projects evolve, the limitations of Access tend to become apparent. This led to a strong requirement from users for SUDAMIH to support the migration of Access databases to other standards. The database management aspect of the DaaS supports the migration of Access databases into Postgresql, which enables additional functionality and brings advantages even outside the scope of the SUDAMIH Project.

The DaaS infrastructure consists of following three main components:

SUDAMIH Web Portal (Administration): for registering projects and project teams, enabling a principal investigator (who could be a lone researcher) to design a relational databases via a simple Web interface, or to upload a database schema and populate it via a data dump (migration).
Database Management (Business Logic): for implementing and populating databases as well as managing access based on user roles, facilitating collaboration with simultaneous multiple users.
Project Web Portal (Public Interface): for providing a Web interface to each active database hosted on the DaaS.

Each registered project can have multiple databases (e.g. separate research, development, and/or production databases) but at any given time only one database can be active. SUDAMIH automatically creates a generic Web Portal for each project using latest J2EE standards. The Project Web Portal allows data population and querying of the active database, with support for reasonably complex queries. It is the intention of SUDAMIH to develop the functionality of the Web Portal so that users can view geospatial data within the browser, and to support annotated multimedia content.

Conclusions

Partnerships

The University of Oxford’s institutional approach to research data management was built upon two principles: researchers being at the core of development; and the need for intra-institutional collaboration amongst service providers. It is perhaps obvious that understanding the requirements of researchers, and retaining their engagement, is critical to the longer-term success and sustainability of any institutional initiatives to better support the management for research data (especially given that if it does not meet their requirements, researchers are free to a large degree to simply ignore central infrastructure and go their own way). It may be less obvious, at least until the research process is fully understood, that the management of research data (though its notional lifecycle) requires a coordinated approach amongst service providers. It matters less which part of the organisation is taking a lead on these activities than whether the relevant providers are engaged in the undertaking and have a reasonably clear sense of not only their current service provision and strengths but also their gaps and weaknesses.

Lessons Learnt

Throughout this article we have, with few qualifications, spoken of research ‘data’. However, is it clear both that this means different things to researchers in different fields, and that datasets can take very different forms. Understanding this is essential for engaging with research communities. The language used must be theirs, not that of the information or technology communities.

Requirements for research data management infrastructure can be specific to particular projects: some might focus on large storage resources for datasets in the range of terabytes, others might be concerned with keeping their confidential data secure, and still others with sharing their data with fellow researchers or the public at large. The experience of Oxford also shows, however, that in spite of the diversity of infrastructure requirements across different fields of research, there are aspects of data management infrastructure required by almost all. The need for data documentation (metadata), training and support, secure storage, and linking data to publications are common across disciplines.

At the institutional level, it is fundamental to have policies in place, to understand the role of the different service providers, and to arrive at a clear understanding of the costs and benefits associated with managing and curating data.

Next Steps

The EIDCSR Project is scheduled to end in December 2010, SUDAMIH in March 2011. By that point we hope to have developed: institutional infrastructure for storing large volumes of data (based upon the existing backing-up and archiving system); a basic yet extensible metadata schema to capture essential information about data stored there; a visualisation platform that will enable researchers to access, share, and annotate very large visual images; a system that will enable researchers to create their own centrally-hosted databases (with contextual metadata); and a set of training materials to improve data management practices. The University’s Research Services Office will also have produced a draft institutional policy for the management of data together with a data portal to provide guidance and point to existing data services in Oxford and elsewhere.

The priority for the institution will be to make ‘production-ready’ those deliverables from the projects for which there is an ongoing demonstrable requirement and for which there is a business plan. The IT infrastructure outputs, for example, have been developed with production in mind, embedding the software development within the relevant service teams. The development of the DaaS will also contribute to the overall plan to improve the Web delivery platform for individuals, groups and departments.

Throughout these projects Oxford has also contributed to the UKRDS planning, both the feasibility study and the subsequent ‘pathfinder’ phase [26]. Whilst funding for the UKRDS remains, at the time of writing, unclear, the contribution that Oxford has made has helped further define the institutional strategic priorities for the ongoing support of research data management activities. In particular, it has emphasised the importance of institutional leadership (comparable to pro vice-chancellor level) to ensure the full engagement of both senior management and influencers within the academic divisions. It has supported the need to implement aspects of institutional IT infrastructure, especially a federated, lightweight, extensible filestore offered and coordinated on a cost-recovery basis and interoperable with other research-support services such as SharePoint and the Digital Asset Management System. And finally, it has confirmed the need to build upon the work of SUDAMIH by extending research data management training to other divisions, especially through existing training facilitators.

There are various internal activities in progress to pursue each of these priorities so that, indeed, the University might fulfil its commitment to ‘to supporting researchers in appropriate curation and preservation of their research data’.

References

Scoping Digital Repository Services for Research Data Management
http://www.ict.ox.ac.uk/odit/projects/digitalrepository/
See for instance The Guardian’s “Hacked Climate Science Emails” feature:
http://www.guardian.co.uk/environment/hacked-climate-science-emails
Lord Oxburgh et al., “Report of the International Panel Set Up by the University of East Anglia to Examine the Research of the Climatic Research Unit”, April 2010, p. 3
http://www.uea.ac.uk/mac/comm/media/press/CRUstatements/SAP
See for instance the European Virtual Observatory http://www.euro-vo.org/pub/ and the related AstroGrid applications http://www.astrogrid.org/,
the eCrystals repository at the University of Southampton http://ecrystals.chem.soton.ac.uk/
and
standards established by the International Union of Crystallography http://www.iucr.org/resources/cif/spec
RCUK, “Delivering the UK’s E-infrastructure for Research and Innovation”, July 2010.
Research data management within the University of Oxford
http://www.ict.ox.ac.uk/odit/projects/datamanagement
Martinez-Uribe, L., “Findings of the Scoping Study and Research Data Management Workshop”, July 2008
http://ora.ouls.ox.ac.uk/objects/uuid%3A4e2b7e64-d941-4237-a17f-659fe8a12eb5
Martinez-Uribe, L., “Research Data Management Services: Findings of the Consultation with Service Providers”, October 2008
http://ora.ouls.ox.ac.uk/objects/uuid%3A0bcc3d57-8d20-42dd-82ac-55eab5cd682b
Serco Consulting, “UK Research Data Service: Report and Recommendations to HEFCE”, December 2008 http://www.ukrds.ac.uk/resources/download/id/16
Martinez-Uribe, L. “Using the Data Audit Framework: an Oxford Case Study”, March 2009
http://www.disc-uk.org/docs/DAF-Oxford.pdf
DISC-UK DataShare http://www.disc-uk.org/datashare.html
Embedding Institutional Data Curation Services in Research http://eidcsr.oucs.ox.ac.uk/
Martinez-Uribe, L. “EIDCSR Audit and Requirements Analysis Findings”, December 2009
http://eidcsr.oucs.ox.ac.uk/docs/EIDCSR_AnalysisFindings_v3.1.pdf
The Data Audit Framework http://www.data-audit.eu/
Bishop, M. J., Plank, G., Burton, R. A. B., Schneider, J. E., Gavaghan, D. J., Grau, V., Kohl P., “Development of an Anatomically Detailed MRI-Derived Rabbit Ventricular Model and Assessment of its Impact on Simulations of Electrophysiological Function” Am. J. Physiol. Heart Circ. Physiol., February 1, 2010 298:H699-H718
http://ajpheart.physiology.org/cgi/content/full/298/2/H699
Dally, K., “Policy Preparations at Oxford”, March 2010, presentation at workshop Institutional Policy and Guidance for Research Data
http://eidcsr.oucs.ox.ac.uk/policy_workshop.xml
University of Oxford Research Data Management Portal http://www.admin.ox.ac.uk/rdm/
Keeping Research Data Safe 2 http://www.beagrie.com/jisc.php
Beagrie, N., Lavoie, B., Woollard, M., “Keeping Research Data Safe 2”, April 2010
http://www.jisc.ac.uk/publications/reports/2010/keepingresearchdatasafe2.aspx#downloads
SUDAMIH Project http://sudamih.oucs.ox.ac.uk/
see SUDAMIH Project Plan for more details:
http://sudamih.oucs.ox.ac.uk/docs/sudamih_JISC_Data_Management_Oxford_dissemination.pdf
As of 30th September, 2009. Figure combines faculty staff, other academic and research staff, and postgraduate students on research courses. Provided by University of Oxford Humanities Division.
Wilson, J. A. J., Patrick, M., “Sudamih Researcher Requirements Report”, July 2010
http://sudamih.oucs.ox.ac.uk/docs/Sudamih%20Researcher%20Requirements%20Report.pdf
Presentations from the SUDAMIH Data Management Training For the Humanities workshop are available from the workshop Web page:
http://sudamih.oucs.ox.ac.uk/training_workshop.xml
Vitae Web page on the Roberts’ Report:
http://www.vitae.ac.uk/policy-practice/1685/Roberts-recommendations.html
UK Research Data Service (UKRDS) http://www.ukrds.ac.uk/

Author Details

James A. J. Wilson
EIDCSR and SUDAMIH Project Manager
Oxford University Computing Services

Email: james.wilson@oucs.ox.ac.uk

Michael A. Fraser
Head of Infrastructure Systems and Services Group
Oxford University Computing Services

Email: mike.fraser@oucs.ox.ac.uk

Luis Martinez-Uribe
Data Management and Curation Consultant
Oxford University Computing Services and Instituto Juan March

Email: luis.martinez-uribe@oucs.ox.ac.uk

Paul Jeffreys
Director of IT
Office of the Director of IT, University of Oxford

Email: paul.jeffreys@odit.ox.ac.uk

Meriel Patrick
SUDAMIH Analyst
Oxford University Computing Services

Email: meriel.patrick@oucs.ox.ac.uk

Asif Akram
EIDCSR and SUDAMIH Software Developer
Oxford University Computing Services

Email: asif.akram@oucs.ox.ac.uk

Tahir Mansoori
EIDCSR Visualization Developer
Oxford University Computing Laboratory

Email: tahir.mansoori@wolfson.ox.ac.uk

Return to top