ePrints UK: Developing a National E-prints Archive

Ruth Martin describes the technical work of the ePrints UK project, and outlines the non-technical issues that must also be addressed if the project is to deliver a national e-prints service

ePrints UK [1] is a two-year JISC-funded project under the Focus on Access to Institutional Resources (FAIR) Programme [2] which began in July 2002 and is due for completion in July 2004. The lead partner is UKOLN of the University of Bath. The aim of the project is to develop a national service provider repository of e-print records based at the University of Bath derived by harvesting metadata from institutional and subject-based e-prints archives using the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) [3]. The project also aims to provide access to these institutional assets through the eight Resource Discovery Network (RDN) faculty level hubs [4] and the Education Portal based at the University of Leeds [5].

Three-stage development project

The development work of the project is divided into three stages. Firstly we are developing a central database of e-print records using ARC harvesting software which will be hosted on the RDN servers at UKOLN. We will be harvesting metadata (and full text where available) of e-print records from OAI compliant repositories in the UK and abroad. At the time of writing, our prototype database is developed and we have started harvesting records from repositories at the Universities of Nottingham, Southampton, Glasgow, and Bath.

Secondly, we are developing suitable SOAP (Simple Object Access Protocol) interfaces to pass the metadata (and full text) to external Web services for enhancement, augmentation, or validation of the metadata. Two of these Web services are based at the OCLC research centre at Dublin, Ohio [6]: these are a subject classification service, automatically assigning Dewey Decimal classmarks to the metadata; and a name authority service which checks the author's name as it appears in the metadata against authority name files (thus author entry "John Peter Smith" can be standardised to "John P. Smith" and his paper can be located alongside all other papers appearing under this form of his name). A third Web service, a citation analysis service, is offered by the Open Citation project team based at the University of Southampton [7]. This service will parse semi-structured citation information in the document text to form structured, machine-readable, citations in the form of OpenURLs. The metadata, enhanced by the Web service applications, will be returned to the central ePrints UK database via the SOAP interfaces, while the full-text material, which will be used to facilitate the subject classification service, will not be retained.

Finally, the ePrints UK service will be made available to end users in a number of ways. Firstly, there will be a central website for the project, integrated with the current RDN website, providing a search interface to all the enhanced, harvested metadata. In addition, ePrints UK will offer shared, configurable discovery services that enable the RDN hubs, UK academic institutions and other organisations to simply embed ePrints UK within their services. This functionality will be based on three approaches: a Z39.50 target supporting Functional Areas A and C of the Bath Profile; a SOAP interface allowing sophisticated integration of ePrints UK within other services; and a simpler, less-sophisticated approach based on Javascript and HTTP linking, for those services not able to support SOAP. These approaches will be closely based on the RDN's existing RDN-Include and RDNi-Lite offerings [8].


 Fig 1 Diagram (224K): Project architecture
Figure 1: The Project Architecture (Andy Powell, UKOLN)


A technical project?

ePrints UK, very clearly then, is a technically-focused project with the ultimate aim of becoming a working service for the RDN and the wider HE/FE community. However what is also clear to us is that the viability of an eventual ePrints UK service depends upon there being sufficient data provider repositories from which to harvest records, and sufficient records within those archives available to be harvested. And this is less a technical and more a managerial problem. The familiar barriers inhibiting the proliferation of institutional e-print archives, and the practice of self-archiving of e-print papers - IPR concerns and publisher constraints; fears about the quality of pre-print material; control of metadata standards, etc. - are as much the concern of our service provider project as they are for anyone trying to develop a data provider service.

What's in a name? (1)

One solution to the shortage of UK-based data provider repositories is simply to look abroad and harvest records from the more established international archives. To decide to do this took a certain amount of heart-searching at our first project team meeting: we are called ePrints UK after all. What would the UK part of our name mean if we took this approach? In the end, we decided that harvesting from international archives was justified since the aim of these archives is to make their resources visible to international users, and we would be contributing to this by making them available specifically to the UK Higher and Further Education communities via the RDN hubs. Phew, the project name could stay.


But our second approach towards tackling the record-shortage problem is to work with the wider OAI community to advocate the virtues of e-print archives and self-archiving to authors, publishers, and potential repository managers. Firstly UKOLN Research Officer Michael Day will be writing four supporting studies covering the impact of e-print archives in supporting learning, teaching and research activities in the UK HE/FE sectors; collection description issues; business and IPR issues relating to e-print service providers; and a report outlining the requirements of funding councils if usage statistics from the ePrints UK service were to be included in the work of the RAE. We hope to publish the findings of these reports in future issues of Ariadne.

Secondly, we plan to work in collaboration with the other JISC FAIR Programme e-prints and e-theses projects with which we have become associated at "cluster group" level, (mercifully known as "eFAIR" for short). These include SHERPA and DAEDALUS, about which articles have already appeared in Ariadne [9]. At an initial cluster group meeting we decided that we had a common interest in advocacy work, and we hope to plan work in more detail at our second meeting in March 2003.

Metadata standards

As the eventual ePrints UK service is aiming to be a central service provider repository for the UK Higher and Further Education communities, we felt as a project to be in a strong position to offer guidance on metadata standards to the newly emerging institutional repositories in the UK. Project officers Andy Powell, Pete Cliff, and Michael Day have to this end compiled recommendations on the use of simple Dublin Core metadata to describe e-prints in data provider archives to facilitate more consistent results when searching and browsing records in the ePrints UK repository and other service provider archives. These guidelines can be found at the project website and have been distributed on the mailing lists of the Open Archives Initiative and the Open Archives Forum [10].

What's in a name? (2)

You may have noticed the slightly idiosyncratic spelling of our project name as ePrints UK where throughout this article I have referred to e-prints spelt with a hyphen. The reason for this is that our project website has the URL www.rdn.ac.uk/projects/eprints-uk and understandably we were getting our hyphens muddled up. But I must admit to envy of the wonderfully named FAIR projects: DAEDALUS [11], SHERPA [12], TARDis [13], RoMEO [14] and the like. Instead of classical, Shakespearean or cult TV allusions, we have a project name which is destined always to be misspelt. But once our eventual service is up and running, we will at least, I hope, have a name that "does exactly what it says on the tin!"


