Web Magazine for Information Professionals

DSpace Vs. ETD-db: Choosing Software to Manage Electronic Theses and Dissertations

Richard Jones examines the similarities and differences between DSpace and ETD-db to determine their applicability in a modern E-theses service.

The Theses Alive! [1] Project, based at Edinburgh University Library and funded under the JISC Fair Programme [2], is aiming to produce, among other things, a software solution for institutions in the UK to implement their own E-theses or Electronic Theses and Dissertations (ETD) online submission system and repository. In order to achieve this it has been necessary to examine existing packages that may provide all or part of the solution we desire before considering what extra development we may need to do.

We evaluated two open source packages to deliver E-theses functionality via a Web-based interface: ETD-db [3] by Virginia Tech, and DSpace [4] written in partnership between Hewlett-Packard (HP) and the Massachusetts Institute of Technology (MIT). A direct comparison is hard as each package is driven by different motivations: ETD-db is specifically designed for E-theses, containing a 'workspace'for supervised authoring of documents, and a thesis-specific metadata set; DSpace has been developed to aid the creation of institutional repositories, with the emphasis more on post-submission workflows and potential digital preservation for a variety of document types.

The DAEDALUS [5] Project at the University of Glasgow has provided us with a round-up of initial experiences with DSpace and another open-source institutional repository package called EPrints.org [6] and notes that: 'They have much in common and the choice of which, or both, or neither, will hinge on a range of local factors' [7]. This study in part then considers whether one of these popular institutional repository software packages is better for E-theses than software written specifically for the job or not.

This comparison will look at some of the common elements between these packages and draw conclusions on which is the best in each field. In addition, it will look at how difficult it will be to modify each of the packages to provide an E-theses service for the UK. This analysis will be considered alongside the medium-term future of each of the packages as they are developed as well as the scope for expansion that each package has within the library and also the university itself. We will spend most time considering elements particularly relevant to E-Theses such as the metadata elements and submission flow, as well as essential areas such as security and administration.

An Introduction to the Software

ETD-db

ETD-db has been developed by one or two developers at Virginia Tech, and endorsed by the Networked Digital Library of Theses and Dissertations (NDLTD) [8]. As of February 2002 development of the official release of this package paused at version 1.7c, but it is still used as their ETD submission, archive and search tool; there are suggestions now that the public development of this product will resume in the near future. Currently this is the most widespread E-theses package in use, in part due to the support it has from the NDLTD. Despite this, there is currently little directional development, with some institutions choosing to install the "out-of-the-box" version, while others make their own changes to the system, which are not easily available to the general community.

ETD-db depends upon the Perl [9] programming language and the MySQL [10] open source database system. Perl is native to most Linux and Unix installs, and MySQL is also very common. In addition to the standard Perl installation, it is also necessary to install additional 'Perl Modules' which enhance the functionality of the language. It requires a reasonably experienced systems administrator to do the prerequisite installation.

DSpace

DSpace has been developed in partnership between HP and MIT. Development is still very much in progress, but as institutional repository software DSpace is making its mark, with an increasing number of institutions around the globe installing, evaluating and using the package.

Currently, the original developers undertake most of the core development, but a growing technical user base is generating suggestions for future releases as well as producing some add-on modules. In addition the DSpace Federation [11] is guiding the transition to a more community-wide open source development model, although this has yet to be finalised. The future of this package seems stable in the medium-term, although it is difficult to predict what the outcome of the federated approach will be.

DSpace depends upon the Java [12] programming language and the PostgreSQL [13] open source database system. It also requires a number of additional Java-based elements to be installed: Tomcat [14], which is a Java Web server; a number of Java code libraries; and Ant [15], a Java compiler. It is recommended that DSpace be installed on a Linux or a Unix machine. It requires an experienced systems administrator to do the prerequisite installation.

Comments

Based on the factors above, the choice that we will need to make can be broken down as follows:

  1. Having a relatively stable, basic package designed specifically for E-theses, but which requires a commitment on our part to patching and possibly supporting.
  2. Having a powerful, developmental package not specifically for E-theses, but which looks like it will be part of a global community for some time.

The remainder of this article will consider the properties of each package that may affect which of the above routes we consider to be the most appropriate.

Submission Procedures

Here we are mainly concerned with what metadata can be collected during submission, although it will also be valuable to see a submission procedure that is well laid out and has a logical flow. This comparison will also take into account how files are added to the repository, and the ease with which procedures can be customised.

Table 1 shows the main elements that the default submission procedures in each package collect. These are compared where possible and the quality of the comparison is commented on.

ETD-db

DSpace

Comments

Abstract

Abstract

 

Degree

Type

Qualified Dublin Core does not have an obvious place for the degree information.

Document Type

Type

This refers to PhD or Masters etc, and as above there is no obvious place to represent it.

Title

Title

 

Keywords

Keywords

 

Name

Author

DSpace allows for multiple authors.

Copyright Agreement

Licence

 

Availability

 

Security is done by directory in ETD-db and authorisation policy in DSpace.

Name of Committee Member

 

ETD-db requests committee members' names, which is not UK-specific.

Title of Committee Member

 

ETD-db requests committee members' titles, which is not UK-specific.

Email of Committee Member

 

ETD-db requests committee members' emails, which is not UK-specific.

Defence/Viva Date

 

Required for Viva date, and missing in DSpace.

Date

Date of Issue

DSpace does not specifically allow for the date of award, but date of issue should be applicable.

Department

Publisher

The pending recommended schema for UK E-theses will suggest that the degree awarder can be referred to as the publisher.

 

Description

No obvious use for E-theses.

 

Alternative Title

DSpace allows for multiple titles on submissions.

 

Series/Report

No obvious use for E-theses.

 

Identifier

Some standard identifier

 

Language

ISO language reference

 

File Format

Document Type/Format may be determined automatically in DSpace.

 

Sponsors

The names of sponsors and/or funding codes associated with the submission.

 

Citation

Useful for PhD by research publication.

Table 1: Comparison of submission elements

Comparison

The question that we must answer is whether each package collects enough information for E-theses, and whether the data that is collected is extensible or flexible in any way. The data in Table 1 explains which fields are analogous in each system, where discrepancies arise and some explanation as to why and how each system deals with that difference. For example, ETD-db collects the 'department' the student has studied in, whilst we feel that 'publisher' could be used to do the same job in DSpace.

ETD-db is designed specifically for theses management, so it collects the Defence or Viva date for each thesis - no such analogue exists within DSpace. Meanwhile, where ETD-db requests an availability level for the thesis, DSpace offers a far more sophisticated way using an authorisation policy system built into the administration area. ETD-db also collects information regarding 'committee members' who are equivalent to examiners in the UK - DSpace provides no obvious analogue to this. In terms of document management, DSpace has a registry of file formats which it recognises and will store as part of the metadata. This allows the administrator to be able to indicate to the user which file formats are supported by the repository.

Both take file uploads via the Web interface, but only ETD-db provides the administrators with the option to give FTP access to users for file upload. The reason that DSpace does not support FTP is due to the way that it stores files internally. Both systems allow multiple files, although DSpace adds the option to attach descriptive text to each file that is uploaded when there is more than one. When DSpace stores the copyright licence, it does so by including it with the files so that it may be also preserved. ETD-db's analogue is to store the copyright notice in the database with the metadata. The subtle difference in these two approaches may be important.

Table 2 and Table 3 illustrate the differences in the way that DSpace and ETD-db store their data in the database; (it is not an exact representation of the structure of the database). DSpace uses Qualified Dublin Core [16] to identify the stored data. This is a well-established basic metadata standard, and it is possible to represent the basic information of many types of digital object, especially when combined with some DSpace specific qualifiers. This overcomes a number of extensibility and flexibility problems that can arise when storing defined data. ETD-db, conversely, has a number of pre-determined database fields, (guided by the ETD-MS [17] standard recommended by NDLTD), which require programmer intervention to alter, although the metadata schema that it uses is good for E-theses, being similar to an extended version of Qualified Dublin Core.

Item ID

Element

Qualifier

Value

1

description

abstract

This is my abstract

1

title

null

This is my title

1

identifier

citation

This is my citation

2

description

abstract

This is another abstract

2

title

null

This is another title

Table 2: DSpace data storage structure

Item ID

abstract

title

1

This is my abstract

This is my title

2

This is another abstract

This is another title

Table 3: ETD-db data storage structure

It is worth noting that some of the fields DSpace collects allow for a significant array of submission types, which are not necessarily relevant for E-theses (e.g. Series/Report). These fields should not affect our opinion on which package is superior in this section. DSpace allows for multiple authors per submission also, and ETD-db does not allow for more than one submission per user simultaneously - but neither should be an issue since it is unlikely that students will be submitting more than one thesis at the same time, or that more than one student will be submitting the same thesis!

Conclusion

It is clear that DSpace has a more comprehensive metadata collection process (partly due to the collection of excess elements) and that it stores this metadata in a more flexible manner. Due to the customisable nature of the Dublin Core registry within DSpace and the option to modify the submission interface (although this is a job for a programmer), DSpace will take any data that can be represented within the qualified Dublin Core. ETD-db has no such flexibility and future changes in metadata schemas could cause significant problems, although the submission interface is no harder to modify than that of DSpace.

Archiving and Access

Both packages are, at some level, designed to make the archiving of certain types of digital resource quicker and easier, but these are not the only requirements. We wish to make the archive available via the Open Archives Initiative - Protocol for Metadata Harvesting (OAI-PMH) [18], whereby 'data providers' (such as the institutional repository) expose their metadata to 'service providers' who harvest it to be used in cross-searching multiple repositories. Additionally, we would like to see an archive that is preservable, stable and secure. In this section, we will be looking briefly to see how each package addresses these issues.

ETD-db

ETD-db undertakes storage in a straightforward manner - all files are held in a basic directory structure; the area that they are in determines the security level applied to the item. If the security settings are changed, then the item is physically moved to another directory. The metadata associated with the item is maintained within the database for as long as the item remains in the repository.

ETD-db comes with the facility to expose its data via the OAI-PMH, but in v1.7c (the last public release) only v0.9 of the protocol is supported. Since at the time of writing the version is v2.0, it is necessary to make a major upgrade to this area of the system; current recommendations from the producers of OAI-PMH recommend at least v1.1 with a view to all users moving to v2.0 in the not too distant future.

DSpace

All items in the DSpace archive have a kind of 'wrapper' in which the parts of the relevant data are stored. This includes all the individual files and the copyright licence. The metadata is maintained in qualified Dublin Core format in the database for as long as the item remains in the repository.

Security settings for the repository are handled via the authorisation policy tool and the security of the archive depends upon the way that the DSpace Administrator configures the policies for each community, collection, and item.

DSpace also comes with OAI-PMH v2.0 fully supported, allowing for immediate compatibility with the more advanced features of this standard. We would expect that future versions of DSpace would have the most up-to-date version of this protocol, (although any significant evolution is not expected [19]).

Conclusion

The DSpace archive is perhaps more geared toward digital preservation, although this issue is still very much in debate and a solution to the problem may be a long time in coming. It may be that digital preservation is an issue which is never 'solved' but which requires constant attention by those wishing to preserve and may not necessarily have anything to do with the software package in question. For this reason it is hard for us to be sure which package is going down the correct route, and even if that route exists.

Moving files around may be a weak spot within ETD-db, since the more you physically move files, the more chance there is of them being lost or corrupted. The purpose of moving files within ETD-db is to apply the Web server's directory security settings to everything in that directory and all sub-directories within. For a complex system this method of providing security is possibly not the best way, although it is much simpler to use and implement than the DSpace approach. Storing the files in a standard directory structure, as advocated by ETD-db, makes the files far easier to access without using the Web interface. DSpace requires you to use import and export facilities in order to move files in and out of its internal storage structure.

Administration and Security

Both systems require submitters to have their own user account before depositing any items. We are primarily interested here in how secure the current sign-up procedures are, and whether they can be replaced with institutional authentication systems.

In addition each package provides administrative features for service providers and administrative staff. These include some workflow facilities that allow certain users to perform tasks on a submitter's item, as well as user administration tools. The most important administrative options will be discussed in the sections below. In addition we will see how the security in each package functions at this level, and consider the best way of addressing any security issues that arise.

ETD-db

To register for an ETD-db account a desired username and password are requested on the registration page, and this is sufficient to open an account. The user is then moved on directly to create a new "Main Record", in which email address, name, and department are requested. The registration page does not validate the user's identity in any way, and anyone who can see the registration page is capable of creating an account.

Like all other Web-based systems, it can be run through a Secure Socket Layer (SSL) and this makes the security of data being transferred to and from the submitter's machine very good. The user's password is kept encrypted in the database using 'crypt', a Unix native encryption package which is sufficiently strong for our purposes. There is no specific location into which administrators might plug an institutionally based authentication system. Instead it would be necessary to write any additional software interface that is required.

In order to access ETD-db's three administrative areas, (Review Submitted ETDs, Manage Available ETDs and Manage Withheld ETDs), it is necessary to have the specific login details for each area which are initially defined at installation. In this system there is only one username-password pair needed to access each area, which means that it is impossible to give an administrator access to a subset of the available options.

Review Submitted ETDs gives the administrator the option to browse the list of all E-theses currently in progress and to perform all of the actions that the author can perform on their item. Effectively this provides a "workspace" where a student and a supervisor can collaborate and communicate on the thesis. From here, the administrator may then also approve the thesis for inclusion into the repository in either "available" or "withheld" status.

Manage Available ETDs provides the facilities to administer the E-theses that are available to be viewed by the public. Primarily, at this stage this allows the administrator to remove the item or move it into the "withheld" section of the system.

Manage Withheld ETDs provides similar functionality to that of Manage Available ETDs, but with the option to move items into a status of "available".

The main drawbacks of this system are that there is no way of providing a single supervisor to a single submission exclusively, and that all supervisors with the permission to access the Review Submitted ETDs section can see the theses of all students who are currently submitting. The list of theses in progress is potentially quite large, so the supervisor may be presented with a list of hundreds upon login. It is also assumed that it will be the supervisor who will eventually agree that the metadata for the item is correct and that the thesis is complete, marked and ready to enter the repository - in general this will not be the case; instead we would expect a qualified librarian to make that decision.

The advantage of this administrative system is it applies security via the Apache Web server to the directories that are restricted. This method of securing directories is well known and reliable, ensuring that content is sufficiently secure. The basic structure of a sensible administration system is here, but there are a number of bugs and security holes as well as a deficit in desirable functionality. For example, a policy system for within the Review Submitted ETDs section would be a valuable addition.

DSpace

To register for a DSpace account requires just a valid email address. The user is emailed with an authentication token, which is then presented back to the system in order to activate the account. When this is done the system requests the user's name and telephone number and desired password. This reduces the chance of multiple accounts for one user, and also prevents people being signed up for an account in error. There is no specific validation, though, as to who can sign up for an account, and anyone who can see the registration page can register.

Again, SSL is available and recommended to make the data being transferred secure. The user's password is kept encoded using MD5 [20] in the database, which is sufficiently secure. There is also the option to include a customised site authentication system (which must be written by the local administrator), or to authenticate users automatically using a Web certificate.

DSpace has many administrative options, and splits its facilities into two parts: Workflow users and DSpace administrators. The fundamental difference between these two sections is that Workflow users may only perform their administrative actions within the constraints of the workflow system, (the permissions assigned to their group and location in the workflow define their available actions). These duties include reviewing, modifying and approving submissions after, and only after, the author has submitted them for consideration. DSpace has three well-defined workflow steps which groups of users can be assigned to in order to perform their duties. DSpace administrators, on the other hand, have access to a large set of tools located in a different area of the system, allowing them to administer user accounts and user groupings, create and configure communities and collections, manage support levels for file types, create and modify system policies and so on.

Login for Workflow users and DSpace administrators is provided through precisely the same system as login for all other users, and the differences in the behaviour of the accounts is purely down to the policies applied to the user account, (so DSpace can have multiple administrator accounts for example).

This method has no real organisational drawbacks, employs a consistent method of system design and has no obvious security holes. Its advantages lie in the fact that there is only one type of user and that each user's properties can be modified, even over time if necessary. The main problem from which DSpace suffers here is that the current implementation of policies is difficult to use, and some refinements to the interface would be welcome.

Conclusion

The level of configuration available within the DSpace administrative area puts it far ahead of ETD-db in this category. Although this is not quite as sophisticated as we might want it is only necessary in a few, more advanced, cases to delve into the code itself to make changes - but this is true of ETD-db also. The rigidity of its workflow, though, could stand in the way of creating the steps that different institutions could fit into their current working methods.

ETD-db is designed to allow the easy authoring and supervision of E-theses, and the tools that it provides for this purpose are straightforward and relatively effective. DSpace provides none of this functionality and would need to have it added before a service could be provided. We have also seen that when withholding items, ETD-db provides far simpler - although potentially flawed - functionality than DSpace.

Overall the methodology employed by DSpace is superior to that of ETD-db, and many of the shortcomings of the DSpace system can be reasonably solved. Conversely, the work required to bring the ETD-db up to the same standard in all other respects is fairly extensive and may require rewriting of much of the software.

Overall Conclusions

In the majority of comparative areas that we have investigated we find that for our needs DSpace is clearly ahead of ETD-db. It is a well-supported package with a future that is being planned now, while ETD-db has been dormant for some time and its future is uncertain. DSpace is far more functional with regards to essential features such as security and administration and this sort of infrastructure is important for any piece of software of this nature, no matter what additional features are available.

The DSpace approach to user accounts and administration is more common than the ETD-db approach. The main shortcomings of the ETD-db method are in the security issues that currently exist, since it is likely that only one submission per user is required in an E-theses system (see Submission Procedures, Comparison). DSpace does not suffer from such severe issues, and any bugs found in the system should be fixed in the development process.

There are also areas where there is no great distinction between the packages. With the uncertainty of the evolution of digital preservation, their archiving methods could be difficult to choose between, although the fact that DSpace is already considering these issues perhaps makes it the favourite in this respect. Likewise, their submission interfaces are adequate, and similarly difficult to modify. Although we have not looked at them here, both have similar browse and search facilities at the moment, although DSpace's facilities are based upon a third-party search engine (Lucene [21]) which is capable of being employed in more powerful ways.

It is worth considering that ETD-db is designed specifically for E-theses, whilst DSpace's support in this regard is fairly generic. The questions that must then be answered are as follows:

  1. How hard would it be to add E-theses support, as we require it, to DSpace?
  2. How hard would it be to bring ETD-db up to the standard that we would require for an E-theses service?

During product evaluation both questions have been considered. The results indicate that bringing ETD-db up to standard would require extensive bug fixing as well as major feature upgrades to improve data structuring, security, and overall behaviour of much of the system. Creating E-theses-specific functionality in DSpace, however, is not only one of the possible features in the development plan, but requires mostly minor modifications to the system, with some software engineering to provide additional functionality. The estimate for not only the ease of doing this, but also the long-term support of modifications, suggest that DSpace would provide a better core system for an E-theses repository than ETD-db.

There is a strong argument to suggest that since an E-thesis is a piece of university research output, it belongs alongside other forms of research output, such as E-prints. ETD-db does not have the functionality to deal with these other forms of electronic document, based on the metadata schema, user account types (and the constraint to one submission per user at any one time), as well as the structure of the administration system. DSpace has been developed with these other forms of research in mind, and adding E-theses support is a logical progression.

The future of E-theses and of archiving and searching in general depends on institutions being able to deliver top quality services, with a high degree of interoperability. This means, among other things, that systems must continue to be developed and they must be able to handle many different types of digital object. We believe that DSpace will fulfil these requirements to a higher degree than ETD-db and will continue to improve in this way in the future.

(Editor's note: Readers may also be interested to read DAEDALUS: Initial experiences with EPrints and DSpace at the University of Glasgow by William Nixon in issue 37).

References

  1. Theses Alive! Project at Edinburgh University Library: http://www.thesesalive.ac.uk/
  2. Joint Information Systems Committee (JISC): http://www.jisc.ac.uk/ and Focus on Access to Institutional Resources (FAIR) Programme: http://www.jisc.ac.uk/index.cfm?name=programme_fair
  3. ETD-db home page: http://scholar.lib.vt.edu/ETD-db/ and Virginia Tech's E-theses repository: http://etd.vt.edu/
  4. DSpace home page: http://www.dspace.org/ and MIT's institutional repository: http://dspace.mit.edu/
  5. DAEDALUS Project, Glasgow University Library: http://www.lib.gla.ac.uk/daedalus/
  6. EPrints.org home page: http://www.eprints.org/ and the University of Southampton e-Prints Service: http://eprints.soton.ac.uk/
  7. Nixon, W. 2003. DAEDALUS: Initial experiences with EPrints and DSpace at the University of Glasgow. Ariadne. Issue 37. http://www.ariadne.ac.uk/issue37/nixon/
  8. The National Digital Library of Theses and Dissertations (NDLTD): http://www.ndltd.org/
  9. Perl scripting language: http://www.perl.org/
  10. MySQL Database: http://www.mysql.com/
  11. DSpace Federation: http://www.dspace.org/federation/
  12. Java at Sun Microsystems: http://java.sun.com/
  13. PostgreSQL Database: http://www.postgresql.com/
  14. Tomcat, Apache Jakarta project: http://jakarta.apache.org/tomcat/
  15. Ant Java Compiler from Apache: http://ant.apache.org/
  16. Dublin Core: http://dublincore.org/ , or more specifically http://dublincore.org/documents/dcmi-terms/
  17. Electronic Theses and Dissertations - Metadata Standard (ETD-MS): http://www.ndltd.org/standards/metadata/current.html . Also for additional interest, ETD-ML Document Type Definition: http://etd.vt.edu/etd-ml/
  18. Open Archives Initiative - Protocol for Metadata Harvesting (OAI-PMH): http://www.openarchives.org/
  19. Lagoze, C. 2003. Open Archives Initiative: where are we, where are we going. OA Forum, Bath. http://www.oaforum.org/documents/wspres.php
  20. MD5 at RSA Security: http://www.rsasecurity.com/rslabs/faq/3-6-6.html
  21. Apache Lucene Search Engine: http://jakarta.apache.org/lucene/

Author Details

Richard D Jones
Richard is the Systems Developer for Theses Alive! at Edinburgh University Library.

Email: r.d.jones@ed.ac.uk
Web site: http://www.thesesalive.ac.uk/

Return to top