Setting up an Institutional E-Print Archive
This article outlines some of the main stages in setting up an institutional e-print archive. It is based on experiences at the universities of Edinburgh and Nottingham which have both recently developed pilot e-print servers(1). It is not the intention here to present arguments in favour of open access e-print archives – this has been done elsewhere(2). Rather, it is hoped to present give an account of some of the practical issues that arise in the early stages of establishing an archive in a higher education institution.
What are 'e-prints'?
‘E-prints’ are electronic copies of academic research papers. They may take the form of ‘pre-prints’ (papers before they have been refereed) or ‘post-prints’ (after they have been refereed). They may be journal articles, conference papers, book chapters or any other form of research output. An ‘e-print archive’ is simply an online repository of these materials. Typically, an e-print archive is normally made freely available on the web with the aim of ensuring the widest possible dissemination of their its contents.
There are a number of successful open access e-print archives already in existence. Perhaps the best known is arXiv(3), a service for high energy physics, maths and computer sciences. Another example is CogPrints which covers cognitive science(4). Both of these are centralised subject-based services. They are single e-print repositories based in single institutions – Cornell and Southampton universities respectively. Authors from any institution are required to submit their papers to the archive remotely by email or using the self-archiving procedure online.
Centralised subject-based archives work; but so far they have only been taken up by a limited number of subject communities. Because of this an alternative model is being suggested by advocates of e-prints: institutional e-print archives . Institutions, it is argued, have the resources to subsidise archive start-up, they also have the organisational and technical infrastructures to support ongoing archive provision. In addition, they have direct interest in wishing to expose their research output to others as this would promote the institution's standing in the research community.
So far, there are few examples of established institutional e-print archives. The archives set up by Edinburgh and Nottingham are attempts to experiment with the possibilities to see if the distributed institutional model works(6). If it does, one of the key factors responsible for its success will undoubtedly be the Open Archives Initiative.
What is the Open Archives Initiative?
The Open Archives Initiative (OAI)(7) “develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.”(8) At the centre of this work is the OAI Metadata Harvesting Protocol. This creates the potential for interoperability between e-print archives by enabling metadata from a number of archives to be harvested and collected together in a searchable database. The metadata harvested is in the form of Dublin Core and normally includes information such as author, title, subject, abstract, and date.
The OAI distinguishes between ‘Data Providers’ and ‘Service Providers’. OAI Data Providers are archives which expose metadata to harvesters. Service Providers collect the metadata and create services with it, such as allowing users to search whi itch allow users to search the metadata. This article concentrates on the first of these roles. The distinction between Data Provider and Service Provider is conceptually very important but the terminology used may be a little unpopular with archive providers. Many archive providers regard their role as providing a service themselves. At Edinburgh and Nottingham we certainly think of what we are doing as providing a service directly to users as well as providing data for automated harvesters.
That being as it may, the most important point to make about OAI is that it creates the conditions for making distributed archives interoperable. There is the potential for a cross-searchable global virtual research archive in which papers will be easily retrievable wherever they are located.
Setting up the server
At both Edinburgh and Nottingham we have set up our e-print archives using eprints.org software(9). This software has been created at the University of Southampton and is made freely available for anyone to use. The major advantage of this software is that it comes already OAI-compliant. Once it is installed, it is automatically ready to generate metadata in a form which can be picked up by OAI harvesters. We have used version 1 of the software and our experiences of this are described here. Version 2 has recently been made available and so new users will want to use install this.
The installation at Nottingham is described here to illustrate a few points. At Nottingham, the web development platform we are using is a standard Intel PC (800MHz; 256MB RAM; 20GB IDE disk) running Linux (SuSE 7.2) with the Apache web server and MySQL database. This provides a simple and inexpensive system on which to trial new applications but we intend to migrate to a more substantial system before offering a full service.
We found the eprints.org software relatively straightforward to install but installation was not without problems and there are still some modules that we have not got working. Installation of both version 1 and 2 require knowledge of Perl and MySQL.
One problem which arose in the installation of version 1 was with the automatic user registration via email (users have to register in order to self-archive their material). We never got this working satisfactorily and currently have to register users manually. There is a new web-based registration system in eprints.org version 2 which we assume is easier to configure. However, if we were to roll self-archiving out to the whole institution, then we would definitely want to integrate with our current user registration methods (such as via LDAP to our NDS), rather than create another set of usernames and passwords for users.
Once the software is installed it is necessary to configure the metadata formats (such as the subject hierarchy and file formats) and customise the user interface. It is simple to design your own subject hierarchy and load this into the database. However it is far more complex to alter this once you have started to upload documents, so it is important to get this right before uploading too many papers. It is also easy to change the list of accepted file formats and to change the design for the static home and information pages. Similarly it is straightforward to design a custom header and footer and apply this to all the dynamic and static pages. It is only slightly more complex to change the dynamic pages such as the document abstract pages (for example, a couple of lines of Perl adds in the document status).
On the whole, eprints.org software is an impressive and workable piece of software. It allows institutions to create an instant framework for an OAI-compliant repository without having to do their own technical development.
Once the software has been installed, the server needs to be registered with the OAI. The OAI maintains a list of OAI-compliant archives for OAI Service Providers to be able to visit. Before registering the archive the OAI will point a harvester at it and carry out a number of tests to check if it is fully OAI-compliant. When this is completed, they will confirm by email and the archive will added to the public list. Nottingham has already registered its archive. The process is taking longer at Edinburgh, where the appropriate committee of the University has yet to provide its ratification for the Edinburgh Research Archive (ERA) to go public.
Document types and formats
The OAI protocol provides the functionality for metadata interoperability but it is not a specification for archive content. In addition to OAI compliance, it is essential to develop an e-print archive collection policy which specifies various aspects of collection development and management.
One key element of this is document type. What sort of document will be accepted by an archive? A crucial question here is whether the archive will accept pre-prints as well as post-prints. What other criteria would be used for papers? Are conference papers or technical reports acceptable?
Next there is the question of file format. The default on the eprints.org software is to accept
Archive managers may want to add to or take away from these formats. Possible additions may be specialised document preparation formats such as TeX or LaTeX, used by mathematicians and physicists, or common formats, such as Rich Text Format. There are open-source utility programs available to convert from non-supported to supported formats. Conversion from LaTeX to either Postscript or PDF can be achieved by using one of these programs.
Consideration may also be given as to whether any of these default formats should be switched off. HTML for example is a very fluid standard which is difficult to validate easily. It may be thought advisable to not accept documents in this format.
Related to the question of document format is the question of digital preservation. One of the concerns frequently raised by institutions in response to the idea of the development of a ‘free corpus’ of research publications based upon the OAI is the very question of ‘archiving’ itself. The ‘Archive’ of the Open Archive Initiative refers primarily to the process of depositing of articles, rather than to the process of preserving them. A project with a similar acronym, OAIS (the ‘Open Archival Information System’)(10) addresses the question of archiving for long-term preservation. In the medium to long term e-print archive managers may well want to apply its principles in running their archives. As Peter Hirtle has stated, “An OAI system that complied with the OAIS reference model, and which offered assurances of long-term accessibility, reliability, and integrity, would be a real benefit to scholarship.”(11) For this reason it is good to at least be aware of the potential of OAIS now.
OAIS is a model which is based on the premise that digital objects must be converted into bitstreams which can then be preserved indefinitely. This is achieved by a two-stage process known as ‘ingest’, in which data is separated from medium into an underlying abstract form. The underlying abstract form is then mapped into a bitstream, which is preserved. This model, by operating at a high level of logical abstraction, very successfully describes a system for rendering a digital resource into a format for preservation which can then be regenerated by reversing the steps.
Having created the form of the document for preservation, there are essentially three strategies for long-term digital preservation.
- 1.Migration: data is stored in a software-independent format and migrated through successive hardware regimes
- 2.Technology preservation: data is stored together with the hardware and software required to make use of it
- 3.Emulation: the look, feel and behaviour of a resource is emulated over time on a succession of hardware and software configurations. This is most appropriate for resources produced in non-standard or proprietary formats. Emulation is really a form of ‘virtual’ technology preservation, although it also requires migration of the emulator software in the medium and longer terms.
With the first and third of these strategies, one important aspect of preservation is that of location. Eprints.org software automatically assigns a unique URL to each paper, but if in future another piece of software is used for an archive, these URLs would probably change. The California Institute of Technology (CalTech) has addressed this issue by creating a system where a perpetual URL can be assigned for each paper (12).
Once the question of document types and formats has been addressed, the next collection development issue is submission procedure. The e-print movement has traditionally been associated with so-called ‘self-archiving’, where the authors themselves format their documents and submit them. This works well in established archives. The eprints.org software has a self-archiving facility but our experience of this is that it is rather long winded and requires a certain amount of IT literacy. Some users may well be put off.
In view of this some advocates of e-prints are suggesting that submission to an institutional archive should be mediated. At least at the beginning the library (or whoever is managing the archive) should deposit the items on behalf of users. We have found this more or less the only thing that works. It may also involve the additional job of file formatting and conversion for users. For example, many users may not have the facilities to convert a word-processed file into a PDF. There is an argument which says that the archive administrator should take this on, at least at the beginning.
Metadata standards and quality
Another role in which the archive administrator should be active is metadata creation. Since OAI is based on the exchange of metadata, getting the metadata right is fundamentally important. The OAI protocol harvests unqualified Dublin Core metadata but in practice this can mean pretty much anything. It is crucial to have some kind of metadata quality threshold to ensure that it is accurate and sufficiently detailed. This is particularly true in the context of self-archiving. One of the potential problems of self-archiving is self-created metadata, with all of the inaccuracies that implies. In a self-archiving environment it is important to have some kind of approval process where metadata can be checked and if necessary enhanced before the record is made public. The eprints.org software builds this into the workflow. An item has to be approved by the system administrator before it goes live. The administrator can accept, edit or bounce a submission at that stage.
Metadata format and quality is obviously crucial to OAI Service Providers who are harvesting it and creating search facilities. The creators of the ARC service (an experimental Service Provider) report a number of problems associated with metadata diversity(13). These range from simple things like different variant spellings or date formats to more complex problems like different subject descriptors. Normalising this kind of metadata to - say - create a meaningful browse index is very difficult. They suggest the use of controlled vocabularies, but how realistic this is remains to be seen.
One other possible weakness of the OAI protocol relates to metadata. This is the fact that the OAI metadata is not picked up by conventional search engines. We have found that search engines actually do pick up some of the HTML from our browse pages in Nottingham archive. But this is not very efficient. New software tools such as DP9(14), which can translate OAI compliant metadata into search engine-friendly data, may be important in addressing this problem for OAI archive managers.
The installation of an OAI-compliant e-prints server from scratch is not a costly business in terms of hardware and software. Most of the costs are in staff time. A rough guide to the staff time required to carry out the installation is:
- Software installation: one to two days
- Web interface customisation: three days
Added to this are the hardware costs of a server. This cost does depend on whether or not a dedicated server is purchased and installed for the e-print archive.
The costs of installation, however, are insignificant compared with ongoing costs of managing the archive and in particular in encouraging participation by researchers. This is something we are only just beginning to do at Edinburgh and Nottingham but it is already clear that this is the biggest challenge.
Encouraging user participation
What kind of participation?
Setting an archive up is one thing, getting users to participate in its ongoing development in quite another. The participation of users is required in two main ways to make an e-prints archive work: first they need to contribute content, secondly they need to use it. Left to itself, this is a chicken and egg situation (they won’t use it until there is content; they won’t contribute content until they are using it) so some kind of initial effort is required from the archive managers to get things moving. The most important (and most difficult) thing is getting content in place.
There are two phases to getting content. The initial (short-term) phase is getting enough content in place to set up a demonstrator. The second (medium to long-term) phase is getting a critical mass of content in place to provide a useful service. The first is important in achieving the second. Based on the principle that ‘demonstration is better than description’, it is much easier to discuss the possibilities if you are looking at a demonstrator. Users understandably like to see that the thing works before they will want to contribute to it.
In the short term
In setting up a demonstrator database it is important to get some ‘real’ content in place. We found that it was easiest to include publications already in the open access public domain. At both Nottingham and Edinburgh, we discovered some of these on the institutional web site, either on personal or departmental pages. Academics pointed us at articles on their departmental servers in some cases. We located others in existing e-print archives, such as arXiv. In all cases we contacted the staff concerned by email and asked them if we could include the items in our e-prints demonstrator. In all cases they agreed and some even sent back a couple of other papers as well. This approach enabled both institutions to get 50 or so papers in place relatively quickly for demonstration purposes.
Arguing for e-prints
One of the key ways in attempting to get content for an e-print archive is to talk to academic colleagues more generally about scholarly communication issues. An e-print archive is after all not an isolated development. Rather it is a response to a number of structural problems in the academic publishing industry. Describing these problems and showing how e-prints are a possible solution is crucial to any institutional advocacy campaign.
A word of warning here. Academics are not normally interested in the ‘serials crisis’ per se. To simply observe that serial prices have been rising at unreasonable rates since the 1980s is not normally convincing in itself that anything needs to be done by researchers. This can easily be written off as ‘the library’s problem’. Rather it is important to marshal the arguments from the point of view of a researcher, as a contributor to and reader of the literature.
So, What is in it for the researcher? There are a number of possible approaches to answering this question:
- Lowering “impact barriers”. E-print archives make papers more visible. Papers are freely available for others to consult and cite. Evidence is beginning to emerge that work which is freely available is in fact cited more(15).
- Ease of access. This is the other side of the coin of lowering impact barriers. It means that access to the literature should be freed up, in contrast to the current system where most of the research literature is not easily available to most researchers.
- Rapid dissemination. Depending on what document types are accepted in the archive (pre-prints or post-prints) online repositories can really speed up the process of dissemination of research findings. In certain fast-moving disciplines this can be an attractive prospect. It was one of the main motivating factors behind the set up of the arXiv service.
- OAI functionality. OAI-compliant archives can be cross-searched. The potential for a global cross-searchable research archive needs to be emphasised. Showing users the potential of using services like ARC may help them to see how that cross-searching can work in practice. We have found that we need to spell out this advantage very clearly. Academics are not familiar with terminology such as ‘interoperability’ or even ‘cross-searching’. If they are left with the impression that searchers will need to log in to an lots of different institutional servers – at Nottingham or Edinburgh or any other university – they will regard the initiative as doomed to failure (as indeed it would be).
- Value-added services. These might include presenting authors with details of hits on their papers, or enabling them to create publications lists on their own pages from e-print archive data. Developing services such as Citebase may create the potential in the future to give users citation analyses of their work(16).
As well as persuading authors on the ground it is important to persuade institutional managers and policy makers. The question is: what is in it for the institution?
- Raising the profile of the institution. Ensuring that the research output of the institution is widely disseminated is in the interests of the institution as a whole. This helps to enhance its reputation and thus its ability to attract high quality researchers and further research funds.
- RAE management. E-print archives may be helpful in managing submissions for any future RAEs. They ensure that a good number of papers (and bibliographical data about them) are easily available in advance.
- Long term cost savings. These savings will result in reducing outlays for periodical subscriptions.
This last point is not one to push too much. There is a danger that it may cause managers to reduce the money allocated to libraries prematurely. It needs to be emphasised this is a long-term potential gain if the investment is put in now in setting up and populating these archives. The idea that e-print archives will lead to immediate gains as part of wider policies to manage the institution’s informational assets is one that sometimes seems to strike a chord with senior managers.
As well as putting the positive case for e-prints it is also important to address concerns that might be raised in the minds of academics and managers. In our experience there are a number of major concerns that academics seem to raise on a regular basis:
- Intellectual property rights (and particularly copyright)
- Quality control (and particularly peer review)
- Workload (theirs!)
- Undermining the ‘tried and tested’ publishing status quo (on which academic reputations and promotions rest)
The question of IPR and copyright is an interesting and complex one. Who owns copyright of research output? The custom and practice in most HEIs is that academic authors are permitted to claim or dispose of copyright themselves. Many research journals require them to sign over copyright before publication. However, there is an argument in law that the copyright or research output is actually owned by the employer (the HEI) rather than the individual. But we have found that raising this question with academics can be rather sensitive. In many cases, rather than attempt to fight this one out with researchers, it is best to assume that the author owns copyright and take it from there.
The most important thing is that authors should be discouraged from thoughtlessly signing away their copyright to publishers. Authors should be encouraged where possible to retain their copyright by either submitting to journals that do not require sign-over or altering the copyright agreement to retain their copyright (or at least e-distribution rights). If this fails, staff can be encouraged to invoke the ‘Harnad-Oppenheim’ strategy where a pre-print version of an article plus corrections (made as a result of referee comments) are deposited in an e-print server(17). All of this may require someone with a good knowledge of copyright being available to help authors. At Edinburgh and Nottingham we are certainly considering ways of giving academic authors easy access to copyright advice when they need it.
The crucial message is that authors do not need to stop submitting their work to high-impact traditional journals. They should carry on doing so but also place a copy of their work in the e-print archive. It is not an ‘either/ or’ situation.
This argument may also help to allay fears over quality control. Authors often fear that self-archiving is the same a self-publishing and that it undermines peer review. The important point here is that for the immediate term at least authors should still submit their papers to journals in order to get the peer review ‘kite mark’. In physics, researchers still submit their work to journals even though they put it on arXiv.
Physicists do however also deposit pre-refereed versions of papers (pre-prints) on arXiv. Physics has a well established pre-print culture. But this is not the case with other disciplines. In our experience, many academics from other disciplines strongly dislike the idea of publicly available pre-prints. We have found it useful in these cases to downplay the pre-print idea and encourage authors to contribute post-prints only to the e-print archive. Discipline differences are such that some institutions might consider having a number of different e-print archives for different subject areas each with different policies on document type.
It is important whatever happens that e-print archives are run in such a way that they address the needs and working patterns of researchers. Things should be made as easy as possible for them to contribute. At the beginning, emphasising that ‘the library will do the work’ may be only the way to get content. Academics do not want additional bureaucratic burdens nor do they want to have to learn new IT skills. Allowing them to email a paper to an archive administrator who will then do the format conversion and e-print submission will encourage them to provide content.
They will also be encouraged to provide content if they do not think that the e-prints movement will undermine the ‘tried and tested’ norms of scholarly communication. Some academics are reasonably content with the existing systems. They have built up their reputations using them. They do not view commercial publishers in the negative light that some librarians do since they are often shielded from the economic realities of the journal industry. A few who are editors might even receive some form of payment from publishers. It is often a good idea therefore to picture self-archiving as complementary to ‘tried and tested’ journals, which is actually the case. Once again the fundamental message is ‘do not stop submitting papers to peer reviewed journals - but please deposit them in the e-print archive as well’.
At Edinburgh and Nottingham we are only just getting under way with our advocacy so we do not have too much experience, but we are already finding it a challenge. We have used a number of different dissemination methods:
- Setting up a project web site (linked to from the archive itself). This can act as a focus for developments and news(18).
- Producing a briefing paper. This is useful for presenting to committees. It should include specific recommendations for action and should be no more than two sides of A4.
- Distributing literature, such as the SPARC Create change leaflet(19).
- Using university magazines, including the Library user newsletter.
- Presenting at departmental meetings and university committees.
- Organising special advocacy events for university staff.
Various staff can be involved in these activities. For senior university committees senior library managers should be involved. Their commitment is often a good way to ensure the project retains momentum. At other levels, subject librarians are often ideally placed to spread the word and encourage participation.
As with all library development projects, trying to identify ‘champions’ in academic departments who can encourage colleagues to take part is often the most valuable approach. It may be possible to try to take a departmental approach in which several members of staff from a single department are encouraged to contribute. Gaining the support of a senior member of staff may be crucial here. Of course, it is important to pick the right champions. It is crucial their ideas are not too radical. For example, There are academics who are very opposed to traditional peer review practices, for example, and who may conceivably view open archiving as a banner under which to promote their views. ‘Champions’ of this sort may do more harm than good.
Whatever methods are used, our limited experience shows that it is a slog. There is no magic bullet. The message has to be put across using different media and fora on repeated occasions. It takes time for it to penetrate.
The institutional e-print model still needs testing but it certainly has potential. What we need now is more examples of institutional e-print archives to explore implementation issues. We also need more OAI Service Providers to see whether search facilities and other value added services can be provided in a way which is useful to researchers. It is hoped that in the UK the JISC-funded FAIR (Focus on Access to Institutional Resources) programme will help to promote the use of OAI-compliant implementations, including e-print archives(20). Whether there is external funding available for implementers or not, it needs to be recognised that OAI-compliant e-print archives are a real opportunity to improve the access to the research literature to enhance the scholarly communication process. Library and information professionals should have the vision to be taking the lead on these important developments.
- (1) Nottingham’s is currently publicly available at: http://www.nottingham.ac.uk/library/eprints/. For other experimental institutional archives see, for example, Glasgow’s available at http://eprints.lib.gla.ac.uk:333/.http://www-db.library.nottingham.ac.uk/eprints/
- (2) See for example Stevan Harnad, ‘For whom the gate tolls? How and why to free the refereed research literature online through author/institution self-archiving, now’. Available at http://www.cogsci.soton.ac.uk/~harnad/Tp/resolution.htm.
- (3) http://www.arxiv.org/
- (4) http://cogprints.soton.ac.uk/
- (5) See Stevan Harnad, '‘The self-archiving initiative'’ Nature: webdebates. Available at http://www.nature.com/nature/debates/e-access/Articles/harnad.html.
- (6) Some of these issues are discussed in Stephen Pinfield '‘How do physicists use an e-print archive? Implications for institutional e-print services'’. D-Lib Magazine, 7, 12, December 2001. Available at http://www.dlib.org/dlib/december01/pinfield/12pinfield.html. UK mirror site: http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/december01/pinfield/12pinfield.html.
- (7) http://www.openarchives.org/
- (8) http://www.openarchives.org/documents/FAQ.html#What is the mission of the Open Archives Initiative.
- (9) http://www.eprints.org/
- (10) http://ssdoo.gsfc.nasa.gov/nost/isoas/. See Peter Hirtle, ‘Editorial: OAI and OAIS: what’s in a name?’ D-Lib Magazine 7, 4, April 2001. Available at http://www.dlib.org/dlib/april01/04editorial.html. UK mirror site: http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/april01/04editorial.html.
- (11) Peter Hirtle, ‘Editorial: OAI and OAIS: what’s in a name?’ D-Lib Magazine 7, 4, April 2001. Available at http://www.dlib.org/dlib/april01/04editorial.html. UK mirror site: http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/april01/04editorial.html.
- (12) Ed Sponsler ‘PURR - The Persistent URL Resource Resolver’, October 2001. Available at http://resolver.library.caltech.edu/caltechLIB:2001.003.
- (13) Xiaoming Liu et al ‘Arc - An OAI service provider for digital library federation’. D-Lib Magazine, 7, 4, April 2001. Available at http://www.dlib.org/dlib/april01/liu/04liu.html. UK mirror site: http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/april01/liu/04liu.html.
- (14) http://arc.cs.odu.edu:8080/dp9/index.jsp
- (15) Steve Lawrence ‘Free online availability substantially increases a paper's impact’. Nature: webdebates. Available at http://www.nature.com/nature/debates/e-access/Articles/lawrence.html.
- (16) http://citebase.eprints.org/
- (17) Stevan Harnad, ‘For whom the gate tolls? How and why to free the refereed research literature online through author/institution self-archiving, now’, Section 6. Available at http://www.cogsci.soton.ac.uk/~harnad/Tp/resolution.htm#Harnad/Oppenheim.
- (18) Examples at Nottingham, http://www-db.library.nottingham.ac.uk/ep1/information.html, and Glasgow, http://www.gla.ac.uk/createchange/.
- (19) For online equivalent see http://www.gla.ac.uk/createchange/
- (20) http://www.jisc.ac.uk/pub02/c01_02.html
Mike is Web Support Officer in Library Services at the University of Nottingham
Academic Services Librarian at the University of Nottingham
John is the Director of the SELLIC project at the University of Edinburgh (Science and Engineering Library)