This article presents some new digital library development activities which are predicated on the concept that research and learning processes are cyclical in nature, and that subsequent outputs which contribute to knowledge, are based on the continuous use and reuse of data and information . We can start by examining the creation of original data, (which may be, for example, numerical data generated by an experiment or a survey, or alternatively images captured as part of a clinical study). This initial process is usually followed by one or more additional processes which might include aggregation of experimental data, selection of a particular data subset, repetition of a laboratory experiment, statistical analysis or modelling of a set of data, manipulation of a molecular structure, annotation of a diagram or editing of a digital image, and which in turn generate modified datasets. This newly-derived data is related to the original data and can be re-purposed through publication in a database, in a pre-print or in a peer-reviewed article. These secondary items may themselves be reused through a citation in a related paper, by a reference in a reading list or as an element within modular materials which form part of an undergraduate or postgraduate course. Clearly it will not always be appropriate to re-purpose the original data from an experiment or study, but it is evident that much research activity is derivative in nature.
The impact of Grid technologies and the huge amounts of data generated by Grid-enabled applications , suggest that in the future, (e-)science will be increasingly data-intensive and collaborative. This is exemplified in the biosciences where the growing outputs from genome sequencing work are stored in databases such as GenBank  but require advanced computing tools for data mining and analysis. The UK Biotechnology and Biological Sciences Research Council (BBSRC) recently published a Ten Year Vision  which describes this trend as "Towards predictive biology" and proposes that in the 21st Century, biology is becoming a more data-rich and quantitative science. The trend has clear implications for data/information management and curation procedures, and we can examine these further by returning to the concept of a scholarly knowledge cycle.
A complete cycle may be implemented in either direction so for example, discrete research data could (ultimately) be explicitly referenced in some form of electronic learning and teaching materials. Alternatively, a student might wish to "rollback" to the original research data from a secondary information resource such as a published article or from an element within an online course delivered via a Learning Management System . In order to achieve this, a number of assumptions must be made which relate largely to the discovery process but are also closely linked to the requirement for essential data curation procedures. The assumptions are:
The scholarly knowledge cycle is shown below in Figure 1.
We can take a closer look at some typical research information workflows which form a part of the cycle, by focusing on a discrete domain, chemistry, and in particular draw on experimental detail from the Combechem Project  which is an e-Science Grid-enabled initiative in the area of combinatorial chemistry investigating molecular and crystal structures. The Combechem Project is an ideal research test-bed because large quantities of varied data are generated. These include electronic lab books, crystallography data and physical chemistry data i.e. textual, numeric and 2/3D molecular structure datasets. Outputs from the project are published in fast track "Letters" formats and as articles in peer-reviewed journals. These articles reference experimental data. In addition, "Letters" are referenced in postgraduate teaching modules.
Currently, once a research experiment in this area is finished, the initial dissemination may be via a letter or communication, followed later by a more detailed explanation in a full paper describing in-depth analysis and collating several related results. Reference data may be provided at this point, but there is unlikely to be any link back to the raw or processed data. The ability to publish data directly and the wider availability of e-prints suggest that these potentially valuable connections can be made. This linking is illustrated by an example from crystallography:
eBank UK is a new JISC-funded project which is a part of the Semantic Grid Programme . The project is being led by UKOLN in partnership with the Combechem project at the University of Southampton and the PSIgate  Physical Sciences Information Gateway at the University of Manchester. This new initiative is set in the context of the JISC Information Environment which aims to develop an information architecture  for providing access to electronic resources for UK higher and further education . The eBank UK pilot service is a first step towards building the infrastructure which will enable the linking of research data with other derived information.
The project will build on the technical architecture currently being deployed within the context of the ePrints UK Project  and which has been described in a recent Ariadne article . The architecture supports the harvesting of metadata from eprint archives in UK academic institutions and elsewhere using the OAI Protocol for Metadata Harvesting (OAI-PMH) . The eBank UK Project will augment this work by also harvesting metadata about research data from institutional 'e-data repositories'. Initially this will encompass data made available by Combechem, but will include data from other sources in the longer term. Metadata records harvested from e-data repositories will be stored in the central database alongside the eprint metadata records gathered as part of the ePrints UK Project.
The eprints.org software will be adapted to provide storage for and metadata descriptions of the research data outputs. The data will be described using a schema which is based on existing work in the Combechem Project, and will be described in a human-readable document and as machine-readable XML and RDF schemas. The XML schema is required so that metadata records conforming to the schema can be exchanged using the OAI-PMH. The RDF schema will support use in the context of the Semantic Web/Grid. Recommendations are required for how e-prints should cite the research data on which they are based, and this may be achieved using a URI based on the unique identifier that is assigned to the research data when it is deposited in the e-data archive.
An enhanced end-user interface for eBank UK, targeted for delivery through the RDN PSIgate Hub, will be developed which will offer navigation from eprint metadata records to research data metadata records and vice versa. The Web interface will be hosted on the RDN Web site  and will form the basis of the interface that will be embedded into the PSIgate Web site. Embedding will be implemented using the CGI-based mechanism that was developed for RDN-Include  and by investigating development of a Web Service based on the Web Services for Remote Portlets (WSRP)  specification.
eBank will also investigate the technical possibilities for inferring which subject classification terms may be associated with research data, based on knowledge of the terms that have been automatically assigned to the eprints which cite those data resources.
Figure 2 below outlines the eBank UK information architecture framework under development (diagram by Andy Powell).
There are currently many initiatives to promote open access to the research literature through new approaches to scholarly publishing. These are based on the author self-archiving principle  and have focused on the creation of subject-based repositories of e-prints such as Cogprints  and arXiv , institutional repositories   and activities to facilitate aggregation or federation such as the national pilot service ePrints UK.
In a similar manner, data repositories have been created on a national basis  , on a subject basis  and even about particular species . More recently, Grid-enabled UK e-science projects such as AstroGrid  and GriPhyN  which are highly data-centric, are generating peta-bytes of data and require advanced tools for data management and data-mining. The move towards a distributed Virtual Observatory will also require new distributed search and query tools such as SkyQuery  and Web Services form a key component of the technical infrastructure underpinning this development.
From a scholarly perspective, the strategic importance of institutional repositories has been articulated by Clifford Lynch  and he notes their potential for addressing data dissemination and preservation in addition to their role in managing other scholarly outputs. Digital asset management systems such as DSpace  support the storage of datasets as well as other digital formats. New projects are underway to investigate the issues around the preservation of institutional research data in the JISC-funded Supporting Institutional Records Management Programme .
Whilst the case for institutional repositories has been well-presented in a SPARC Position Paper , there is evidence that the authors themselves, i.e. academic researchers, appear to be reluctant to deposit and share information in this way  . The cultural barriers and issues of intellectual property rights and quality control that have been identified as concerns to academics are being investigated in the JISC FAIR Programme/CURL-funded TARDIS , RoMEO  and SHERPA  projects. A further key issue is the research impact factor element in the context of freely-available electronic publications and this has been addressed in a recent paper by Stevan Harnad and colleagues . Will similar reservations be expressed by academics when asked to deposit their research data in institutional repositories? To further complicate the picture, in some disciplines such as biomedicine, there are difficult issues of sensitivity, privacy and consent which act as barriers to the availability of electronic data and information. In summary, it is vital that the socio-economic and cultural barriers are addressed in parallel with the technological challenges.
Provenance is a well-established concept within the archives community  and in the art world , where the lineage, pedigree or origins of an archival record or painting are critical to determining its authenticity and value. It is of equal importance in science where the provenance or origin of a particular set of data is essential to determining the likely accuracy, currency and validity of derived information and any assumptions, hypotheses or further work based on that information . Significant research has been carried out on describing the provenance of scientific data in molecular genetics databases SWISSPROT and OMIM  and in collaborative multi-scale chemistry initiatives . The topic has recently been explored in a workshop at the Global Grid Forum (GGF6) in relation to Grid data  and the relationship of provenance to the Semantic Web has been noted . The Open Archives Initiative has also carried out some work to describe the provenance of harvested metadata records  and the concept is included as an element in the administrative metadata which is part of the METS  metadata standard.
eBank will be reviewing the body of work on provenance, describing the observed trends and directions, and will particularly focus on the relationship between the creation, curation and management of research data and its integration into published information resources which are contained in digital libraries.
A recent joint UKOLN/National e-Science Centre (NeSC) Workshop  , Schemas and Ontologies: Building a Semantic Infrastructure for the Grid and Digital Libraries explored some generic challenges which creators of digital libraries and Grid data repositories need to address. More specifically, eBank will be considering a wealth of metadata issues relating to the perceived hierarchy of data and metadata from raw data up to "published" results. These include the need to identify common attributes of a dataset and relate them to domain-specific characteristics, managing legacy data, dealing with metadata created at source by laboratory equipment and the relationship to wider data curation activities. The Combechem project will act as a discrete case study and metadata from three sources (e-Lab book, crystallography data and physical chemistry data) will be used to inform the drafting of a schema for describing chemistry datasets.
It is hoped that the outcomes of the project will have the potential for very significant long-term impact on the whole scholarly communications process. The availability of original data, together with the ability to track its use in subsequent research work, scholarly publications or learning materials, will have outcomes in a number of areas:
Finally, it is clear that some historic scientific controversies would have been easier to resolve within an environment of assured data provenance, and we can cite the example of Rosalind Franklin . She was one of the leading molecular biologists of the mid-twentieth century, who worked with James Watson, Francis Crick and Maurice Wilkins in the early 1950s. To this day, the debate continues on the significance of her role in determining the structure of DNA, in the absence of persistent digital evidence!
The contributions of Andy Powell (UKOLN) and Jeremy Frey (Combechem Project, University of Southampton) are gratefully acknowledged.
Dr Liz Lyon
Article Title: "eBank UK: Building the links between research data, scholarly communication and learning" Author: Liz Lyon
Publication Date: 30-July-2003 Publication: Ariadne Issue 36
Originating URL: http://www.ariadne.ac.uk/issue36/lyon/