The Tapir: Adding E-Theses Functionality to DSpace

richard jones

The Tapir: Adding E-Theses Functionality to DSpace

Richard Jones demonstrates how the Theses Alive Plugin for Institutional Repositories (Tapir) has provided E-Theses functionality for DSpace.

The Theses Alive Plugin for Institutional Repositories (Tapir) [1] has been developed at Edinburgh University Library [2] to help provide an E-Thesis service within an institution using DSpace [3]. It has been developed as part of the Theses Alive! [4] Project under funding from the Joint Information Systems Committee (JISC) [5], as part of the Focus on Access to Institutional Resources (FAIR) [6] Programme.

This article looks at DSpace, the repository system initially developed by Hewlett-Packard and MIT and subsequently made available as a community-owned package. We discuss how this community driven open-source development method can work when third-party tools such as the Tapir are also involved, and what issues arise. One of the primary objectives is to introduce the Tapir in detail, explaining what has been developed and what relevance this has to E-Theses. There is also a very brief introduction to the UK-recommended E-Theses Metadata set.

As a use case, we will look at the recently released Edinburgh Research Archive (ERA) [7], and will examine DSpace and the Tapir working together to provide Edinburgh's Institutional Repository.

Finally we look at how development of tools such as the Tapir can be sustained and what issues were encountered during the recent upgrade from DSpace 1.1.1 to DSpace 1.2. The future of the package is considered and some recommendations for the future are made.

What is DSpace?

DSpace was initially developed by Hewlett-Packard and MIT in collaboration [8]. The objective was to create a package that could provide an institutional repository which addressed the problem of digital preservation as a central theme. Since then there have been considerable changes to the process of development, and these changes are continuing. A number of individuals from institutions using DSpace have taken on the role of developers, and a community of interested parties has evolved who have started to feed code back into the core. This is part of the open-source development model into which DSpace is moving, and will be discussed further in the following section.

The application itself provides ways of capturing, storing, indexing, preserving and disseminating digital objects. In its primary role as an institutional repository package this is the intellectual output of the institution when such output is in digital form. This can include, but is not limited to, research papers, conference papers, book chapters, datasets, learning objects and, of course, E-Theses.

Capture happens primarily through a Web interface which collects some metadata elements and manages file uploads; as a secondary means of capturing items there is also a bulk upload feature. Storage currently occurs as a combination of information in a relational database and a traditional file system, although plans to change this to enhance the preservation aspects are currently in motion. Lucene [9], a third-party java search engine, provides the indexing and searching facilities that DSpace utilises. Exposure is provided in a number of ways: first there is the traditional Web interface, fed from search and hierarchical browse facilities, second the system exposes its metadata via the OAI-PMH [10].

Open Source Development, DSpace and the Tapir

There are various ways that open-source development can be achieved, and here we will briefly examine the model to which DSpace is attempting to adhere.

The source code is maintained in a version-controlled repository [11] on a publicly accessible server (there are a number of these available, but DSpace is hosted by SourceForge [12]). From this repository anyone may obtain a copy of the current source code in whatever state it is in. A number of individuals have administrative control over the versioning system, and these administrators are referred to as committers; their role is to action any changes to the core code. Only trusted developers should be given this access (by other trusted developers), and they will vet code submissions from contributors to see if they are ready to go into the source. Periodically a version of the source code will be declared as stable and it can be packaged up into an official release which can be downloaded and used by people not interested in working with the development copy.

At time of writing DSpace has yet fully to mature into this form of development, but the first steps have been taken.

The way that third-party developments such as the Tapir fit into this is as follows. First, you must be willing to work from the most bleeding edge version of the source code; this often means having some space on a development machine of your own where you can regularly build and work with the most recent version.

It is also necessary to make a decision as to the developmental model that you will use for your own software. Will you write patches to existing source code and commit the changes to the versioning system, or will you write and maintain your own software pack that can be installed onto DSpace? At Edinburgh University Library we have chosen the latter approach for a number of reasons:

It was not necessarily anticipated that our developments would be of interest to the whole of the DSpace community.
The developmental model of DSpace at the time the project began was not as open as it is now.
The Tapir was primarily being developed for UK E-Theses sites, and development was not expected to move at the same speed as DSpace development.

For these reasons we have created our own SourceForge project [1] and maintain our own source code. The question then is whether we should be using the same developmental model as DSpace? Currently we are not, since the Tapir is a Theses Alive! project outcome and we have some interest in controlling the direction of development. This is not a problem at the moment since there is little code being contributed from other sources, but we are prepared to open development of the software further if there is a demand for it, and the licence allows for the code base to be forked.

If you choose to commit software directly back to the DSpace source code then you must either have committer status with the version control system or be in direct contact with a committer. There are a number of mailing lists that you can join to make this possible [13].

To create the Tapir we chose not only to build separately from the DSpace core but also to create our own Java file package (JAR) which would ultimately be installed in the DSpace library directory alongside other third-party tools such as Lucene or the JavaMail package. The advantage of this over integrating into the DSpace source is easier installation and maintenance.

At this point a brief note on licensing is also warranted: DSpace has been released under a BSD-style licence [14], which is a common standard open-source licence. The Tapir has also been released under a similar licence, with agreement from the University's legal advisors, in part to ensure that future integration with DSpace is feasible.

The Tapir

The objective of the Tapir was, as already mentioned, to provide E-Theses functionality to DSpace. We have previously performed a comparative evaluation of two packages: see DSpace vs. ETD-db: Choosing software to manage electronic theses and dissertations in Ariadne issue 38 [15] to see which was most suitable for our E-Theses management system. Consequently some of the shortcomings of DSpace to ETD-db [16] were adopted as desired features for the Tapir.

The general feature list that we had in mind was as follows:

Allow supervisors to observe the work of students, to make changes, suggestions or comments prior to submission.
Collect the relevant metadata for an E-Thesis.
Allow E-Thesis metadata collection and item submission to be done separately to other research material's metadata collection.
Allow for easy identification of the type of content in the institutional repository (e.g. E-Thesis, E-Print etc).
Provide a metadata export facility for services not using OAI-PMH

This is, of course, in addition to the administrative procedures that were to be developed along with the other project aims from Theses Alive!

At time of writing features 1 - 4 are well developed and 5 is in the pipeline. We now examine the way in which the first four of these features were developed in order to provide some insight into both the nature of the Tapir and the developmental methods that were employed. Later we will look at how this could have been improved both from the point of view of Tapir development and the DSpace architecture.

Supervision Orders

This section discusses the way in which we designed and built the supervision orders facilities, and is the longest section because it introduces most of the issues surrounding developing for DSpace, and tackles them where relevant.

The full range of functionality provided by the supervision order system is as follows:

Collaborative workspace in which items that are in the process of being submitted appear in the supervisor's private workspace. This is good for integrating E-Theses into a traditional institutional repository because the supervisor can also be simultaneously authoring other documents in this workspace and be supervising more than one thesis (See Figure 8).
Tools for supervisors with insufficient privileges to edit a student's submission to be able to observe the ongoing work. This is the sort of functionality that might be required by external advisors.
Tools to allow online, recorded, communication between students and supervisors. At a later time it should be possible to decide whether these notes form an interesting part of the submission or can be discarded.
A system to administer supervision orders that provide the above functionality. This includes an authorisation tool to provide different types of supervision.

To integrate this into the initial DSpace system it is necessary to understand where this functionality can be inserted. For feature 1: in DSpace there is a 'My DSpace' section from which logged-in users can manage their digital items, and this is the obvious place to insert the user interface (UI) for the functionality. In fact, in 'My DSpace' there already exists an 'In Progress Submissions' area in which items part way through submission reside while the author is logged out of the system. This is effectively exactly what an item in our desired collaborative workspace will be, so we can see straight away the point at which we will insert our own functionality.

The underlying object structure out of which DSpace is built models items that exist in the 'In Progress Submissions' section as 'Workspace Items' (org.dspace.content.WorkspaceItem) which behave slightly differently to normal items. Nonetheless, the Workspace Item is a type of Item as far as many of its properties are concerned, and this means that some operations which can be applied to items can also be applied to workspace items; for example, and important in this case, the application of authorisation policies.

Since there is nothing to prevent us making any Workspace Item visible to anyone using the application, all we need to do is provide UI tools for the collaborative workspace, then decide who is allowed to look at any particular Workspace Item via the authorisation policies. We therefore create a linking database table where we join EPerson Groups (collections of authenticated system users) to specific Workspace Items (See Figure 1). Manipulation of this table will be the goal of feature 4.

Figure 1: Basic relationship between EPerson Groups and Workspace Items

Now we can easily obtain, for any single EPerson's 'My DSpace', all Workspace Items with which an EPerson is associated.

We authorise the supervisors to open the item by simply applying policies using the tool developed for feature 4 above to the item from which the Workspace Item is derived. The policies that we define are as follows:

None - The supervisors have no authorisations concerning the workspace item. This is only useful if you intend to configure your own policies manually.
Editor - The supervisors have full editorial control over the item. This gives them precisely the same authorisations as the owner of the item. This would also form the basis of a truly collaborative workspace.
Observer - The supervisors may only observe the metadata and files of the item, but cannot make changes. This is the driving force behind feature 2.

We can extend Figure 1 to show how all this is related inside the system, as depicted in Figure 2.

Figure 2: Advanced relationship between Groups and Workspace Items

The fact that we can now prevent groups from editing items, whilst still giving them permission to read items, means that we must make allowances for a different sort of interface in the workspace. The obvious candidate for this is the standard item viewer in DSpace, but the problem with it is that it requires the item to have a handle [17], which our workspace items do not yet possess.

In order to maintain our development model it is necessary, unfortunately, for us to duplicate a large body of DSpace code in the Tapir in order to modify some of the functionality. We make a modified copy of the item viewer that can see workspace items, but still fulfils the requirements to function within the workspace.

So we provide a basic interface for feature 2, but we must provide an entirely new area of the system for feature 3. This is relatively straightforward as we can easily deploy new servlets [18], which are the crux of the DSpace users' interaction with the core system. In order to explain how these servlets work the notes system (Tapir v0.4) is diagrammatically explained in Figure 3:

Figure 3: Basic representation of a specific servlet as an example of how servlets work

The NotesServlet looks at the request type that the client has made, and maps it onto one of a number of operations that it can perform, runs the relevant procedures and returns the final result. This is also the general form of servlet behaviour.

The final thing that really needs explanation is how we embed these extra facilities into the DSpace application as a whole. We are fortunate to have our own set of core classes which are easily included in the DSpace library, and we have a lot of our own custom user interface components (JSPs [19]) which stand by themselves, and are therefore also easily included in the DSpace Web application.

The core of our problem is that we wish to include access via 'My DSpace', which is part of the DSpace user interface package, and which consists of a large number of components that deal with various parts of the system. The information for the interface is pre-prepared by its servlet (org.dspace.app.webui.servlet.MyDSpaceServlet) before the UI components see any of the information. This, in itself, is not a bad thing; generally speaking it is essential that the user interface and the core logic of any application remain separate [20]. Problems arise when you wish to plug additional components into your UI without modifying any of the core code of the system.

We took the decision to maintain our code separately from the DSpace code for reasons previously discussed. The down-side is that it is then impossible to leverage core functionality in the native user interface without resorting to more obscure and less desirable methods. For example, the Tapir has to wait until DSpace has completed its business when loading the 'My DSpace' page before making a call from the Java Server Page (JSP) to a class which it needs to run before it can output its own interface sections. This is an unpleasant fix which can only be tolerated due to the fact that any other solution is contrary to our general development method and introduces potentially worse maintenance issues. Other solutions are currently being considered.

Toolkits such as Apache Cocoon exist to alleviate these problems in general, and future versions of DSpace will employ that sort of technology to make modular user interface components easier to incorporate.

E-Thesis Metadata

The UK recommended E-Theses metadata set [21] has formed the basis for the Tapir's theses submission system's metadata collection section. This metadata set was developed in collaboration with the Robert Gordon University (Electronic Theses Project [22]), the University of Glasgow (DAEDALUS [23]) and the British Library [24]. Devising a new submission system for DSpace is straightforward enough, although there is a strong case to be made for redesigning the entire submission system to cater for customisable metadata (at time of writing some development has begun in this area at MIT).

The metadata set will not be discussed in detail here, but some of the more interesting recommendations are summarised in Table 1.

Field name	Element and Qualifier	Populated by	Repeatable	Required
Author	contributor.author	Student	No	Yes
Supervisor/ Advisor	contributor.advisor	Student	Yes	No
Institution, College, School	publisher	Default maintained by institution	Yes	Yes
Type, Qualification Level, Qualification Name	type	Student	No	Yes

Table 1: Some metadata elements used in the Tapir

Of particular note is the 'Institution, College, School' element where we have provided the facilities for authority controlled values which cannot be edited by the submitter as well as free-text values that can (See Figure 4). The purpose of this is to ensure that collections can enforce the correct values for their own purposes but if awards have been obtained jointly with other institutions these can be added by the submitter.

Figure 4: The Institution, College and School of the submitter

E-Theses Submission System

The E-Theses submission system has a number of objectives to fulfil:

Sit alongside one or more other submission systems
Collect the metadata discussed in the previous section
Apply a multi-part licence to the item
Apply 'physical' restrictions to item access where necessary

Feature 2 is relatively straightforward once you are familiar with the DSpace API (Application Programming Interface) for reading and writing metadata. All that is required is a modification of the user interface and the underlying servlet. Things only need to become more complicated if you wish to have more sophisticated elements such as our authority-controlled and simultaneously user-editable 'Institution, College, School' field.

Feature 1 is a effectively a meta-feature for this section. We need to provide a choice of one or more submission systems to be made. Currently all requests to start a submission go to DSpace's SubmitServlet (org.dspace.app.webui.servlet.SubmitServlet), so the obvious course of action is to replace this with something which provides all of the required interaction as well as the facility to choose between submission servlets. This includes dealing with the possibility that users may wish to start a submission from within a collection and that they may wish to suspend submission for a short period of time (where the item rests in the workspace), then resume later on. The system behaves as shown in Figure 5.

Figure 5: Choosing and deploying a submission interface

So our submit servlet (ac.ed.dspace.app.webui.SelectSubmitServlet) replaces the current DSpace submit servlet, takes all the same requests as the original, but processes them in a different way. It also provides additional facilities to choose between different submission engines. In this way it is possible to have arbitrarily many submission engines deployed with minor modifications of the servlet.

Substituting our servlet for the DSpace servlet is relatively straightforward, as a single configuration file maintains the mapping from requested URLs to servlets. This mapping is in a basic XML file and we replace org.dspace.app.webui.servlet.SubmitServlet with ac.ed.dspace.app.webui.servlet.SelectSubmitServlet to substitute in all of the above behaviour.

Lastly we move onto solving the licensing and restrictions features 3 and 4, which are intrinsically linked. There are two important concepts that need to be considered when dealing with this: first, that there are three parties involved in licensing (the submitter, the institution and the end-user); second, that restrictions are not necessarily absolute (they may have time or domain dependencies). We consider both of these when designing the Tapir, and the following is a description of how the multi-part licences we use work with our restriction system.

First we define three licences: a deposit licence, which gives the repository administrators the rights they need to hold and maintain the item; a use licence, which gives the end-users the rights they need to use the item in a reasonable manner; a restriction licence, which gives the submitter some control over the availability of the item. Implementation of these licences is site-specific but the Tapir ships with the defaults that we use at Edinburgh University Library. Next we define six restrictions that are available to depositors:

None - no restriction on access
Domain, 1 year - restricted to institutional domain for 1 year
Domain, 2 years - restricted to institutional domain for 2 years
Withheld, 1 year - restricted from all for 1 year
Withheld, 2 years - restricted from all for 2 years
Withheld permanently - restricted from all forever

The process of building the licence is described in Figure 6. All items have a deposit licence, then time-dependent licences are applied for domain-restricted and non-permanently restricted items, which then also have a restrict licence appended. For non-restricted items the simple end-user licence is appended instead.

Figure 6: Procedure flow for constructing multi-part licences

The following is a stripped down example of a full multi-part licence submitted on 30 March 2004:

=========================================================
Time Dependencies on this Licence
=========================================================

The 'Permanent Licence' contained within this file is limited in scope for 2 years. Until 30th March 2006 the licence applies only to users who exist within the University of Edinburgh domain

For all users outside ed.ac.uk (Edinburgh University) the 'Temporary Licence' contained within this file applies.
=========================================================

=========================================================
| Permanent Licence |
=========================================================

Standard Use Licence (e.g. Creative Commons)

=========================================================

=========================================================
| Temporary Licence |
=========================================================

Standard Restriction Licence

=========================================================

=========================================================
| Site Licence |
=========================================================

Licence Required by site to hold item
=========================================================

The choice of restriction also affects the way that 'physical' restrictions apply. As of Tapir v0.4 restrictions are applied automatically for all options other than domain restriction (the issues surrounding domain restriction are still under investigation at time of writing). To provide the restrictions required we simply 'withdraw' the item from the repository, allowing it to exist without being available to any users other than administrators.

Item Type Identification

It was identified fairly early on that academics were interested in maintaining at least some distance between theses and research papers, suggesting that in some situations theses were 'research training' and not necessarily research quality. Without commenting on this, we have chosen to make some minor modifications to the DSpace interface to ensure there is a quick and easy method of identifying all research types that enter the repository. In addition we have removed thesis supervisors from the author listings for the item, preferring instead to list them on the item metadata page instead.

In order to achieve the desired customisations it was necessary for us to replace the DSpace item listing and item metadata classes (org.dspace.app.webui.jsptag.ItemListTag and org.dspace.app.webui.jsptag.ItemTag respectively) with custom local files; these are compiled java classes with display code built in. This is not ideal as it makes localisation that bit more complex and harder to maintain. On the other hand, the tag library approach makes it easy to replace one tag with another in a single XML config file.

Our new item listing, then, simply contains an extra column displaying the item type alongside the title and authors, (authors' names have been removed), as shown in Figure 7.

Figure 7: An extract from our new item listing page

Correspondingly the new item metadata page copes with the E-Theses metadata set correctly, and divides up contributors into their relevant groups (authors, supervisors, advisors).

The Edinburgh Research Archive

To show the Tapir in a typical setting we introduce the Edinburgh Research Archive (ERA), which is an institution repository service run by Edinburgh University Library. The primary aims of ERA are to hold to outcomes of the Theses Alive! and SHERPA [25] projects, being E-Theses and E-Prints respectively. In addition to this, evaluation of DSpace, and thus ERA, is being conducted in contexts as diverse as learning objects and conference posters.

Currently ERA is DSpace 1.1.1 using Tapir 0.3 (along with some additional local customisations), although an upgrade to the most recent versions of these two pieces of software is imminent. Figure 8 shows how we have integrated DSpace into the general 'look and feel' of the Library Web site, and you will notice also the 'WorkSpace', which is a product of the Tapir.

Figure 8: ERA User's homepage

Notice that the user is both authoring and supervising items in the same workspace, and has the same options for both items even though they are only responsible for originating one of them.

What follows now, then, are a set of screenshots of ERA, highlighting the functionality that the Tapir has added to DSpace. Figure 9 shows the various options available to the user in their collaborative workspace. Figure 10 shows an extract from the licence page, where users select their restriction level. Figure 11 shows part of the administrative interface for configuring supervision orders.

Figure 9: Collaborative workspace options

Figure 10: The licence restriction options available to the submitter

Figure 11: Administrative tools for applying supervision orders

Cross-Version Code Maintenance

Maintaining software that is dependent on other software, such as the Tapir, always brings with it maintenance issues. Sometimes problems arise when API changes in the host software require code changes in the agent. Other problems arise when the agent has supplanted host functionality with its own, but then host functionality has been upgraded and the agent would like to take advantage of the improvements.

Examples of both these situations can be found when considering the upgrade of DSpace from v1.1.1 to v1.2 and ensuring that the Tapir is compatible with the latter. During the development of DSpace 1.2 it was decided that upload of files to the server required additional information, and thus the operation called to achieve this was modified to use another parameter. The effect of this is that requests to upload files by the Tapir's submission system were no longer compatible with the storage layer in DSpace. On another occasion, significant functionality had been added to the default DSpace submission system to deal with HTML files and their relationships. This functionality was desirable in the Tapir, and so had to be included in the custom E-Theses and E-Prints submission systems.

Long-term solutions to these sorts of problems are difficult, although a more modular architecture would certainly be an asset for third-party developers. In addition it could be argued that some elements of DSpace are too large and not specific enough. Breaking down some of the current 'modules' would be an advantage for people only wishing to replace small areas of functionality. For example, the submission system could be split into sections: metadata, file management, licensing, verification and commitment, each of which could interoperate with the other modules. In this case, then, the Tapir would only need to override the metadata and licensing sections of the submission system.

DSpace development is proceeding towards such a modular configuration, with the ultimate goal being the 2.x architecture [26]. The path from the current state to the target state, though, is long and we can expect to see software such as the Tapir adapting regularly to remain current.

Conclusion

The Tapir has addressed most of the objectives for which it was originally specified, and the hope is that much of the code which drives it will eventually make its way into the DSpace source code. Now that DSpace 1.2 has been released, and 1.2.1 is imminent, the community-driven development effort seems to be gaining some momentum, and there may well be room for developments such as the Tapir or other work such as DSpace@Cambridge [27], North Carolina SILS-ETD [28], or SIMILE [29] to name but a few (information on other projects is available on the DSpace Wiki [30]), to become part of the main source code.

The Theses Alive! project, meanwhile, draws to a close, but Edinburgh University Library has already made the decision to continue to work with DSpace, and there is the possibility that additional developments will be made along the way as uses for the software are investigated and implemented.

References

Tapir SourceForge page: http://sourceforge.net/projects/tapir-eul/
Edinburgh University Library: http://www.lib.ed.ac.uk/
DSpace: http://www.dspace.org/
Theses Alive! project home page: http://www.thesesalive.ac.uk/
Joint Information Systems Committee (JISC) home page: http://www.jisc.ac.uk/
JISC FAIR Programme: http://www.jisc.ac.uk/index.cfm?name=programme_fair
Edinburgh Research Archive (ERA): http://www.era.lib.ed.ac.uk/
The HP-MIT Alliance: http://www.hpl.hp.com/mit/
Lucene Java search engine: http://jakarta.apache.org/lucene/
Open Archives Initiative, originators of the OAI-PMH: http://www.openarchives.org/
Some example versioning systems: Concurrent Versions System: https://www.cvshome.org/ Sub-Version: http://subversion.tigris.org/
SourceForge, open-source development support: http://sourceforge.net/
DSpace mailing lists for technical information: General Technical List: dspace-tech@lists.sourceforge.net
Developers List: dspace-devel@lists.sourceforge.net
The Berkley Software Design licence, examples of which are available: http://www.opensource.org/licenses/bsd-license.php
DSpace vs. ETD-db: Choosing software to manage electronic theses and dissertations, January 2004, Ariadne, Issue 38. http://www.ariadne.ac.uk/issue38/jones/
Virginia Tech ETD-db system: http://scholar.lib.vt.edu/ETD-db/
CNRI Handle System: http://www.handle.net/
Java Servlet Technology: http://java.sun.com/products/servlet/
Java Server Pages (JSP) Technology: http://java.sun.com/products/jsp/
Model-View-Controller pattern: http://java.sun.com/blueprints/patterns/MVC-detailed.html
UK Recommended E-Theses metadata set:
http://www2.rgu.ac.uk/library/guidelines/metadata.html
Electronic Theses project at the Robert Gordon University: http://www2.rgu.ac.uk/library/e-theses.htm
DAEDALUS Project, University of Glasgow: http://www.lib.gla.ac.uk/daedalus/
The British Library Theses/Dissertations: http://www.bl.uk/services/document/theses.html
SHERPA Project: http://www.sherpa.ac.uk/
DSpace 2.x proposed architecture: http://www.dspace.org/conference/agenda.html and
http://www.dspace.org/conference/presentations/architecture.ppt
DSpace@Cambridge: http://www.lib.cam.ac.uk/dspace/
Hemminger, B. Fox, J. Ni, M. (2004). "Improving the ETD submission process through automated author self contribution using DSpace", ETD 2004, Lexington, KY:
http://ils.unc.edu/bmh/pubs/ETD%202004%20paper.pdf Also see: http://etd.ils.unc.edu/
Semantic Interoperability of Metadata and Information in unLike Environments (SIMILIE): http://simile.mit.edu/
DSpace Wiki: http://wiki.dspace.org/

Author Details

Richard Jones
Systems Developer
Edinburgh University Library

Email: r.d.jones@ed.ac.uk
Web site: http://www.thesesalive.ac.uk/

Return to top