Web Magazine for Information Professionals

The Wellcome Library, Digital

Christy Henshaw and Robert Kiley describe how the Wellcome Library has transformed its information systems to support mass digitisation of historic collections.

Online access is now the norm for many spheres of discovery and learning. What benefits bricks-and-mortar libraries have to offer in this digital age is a subject of much debate and concern, and will continue to be so as learning resources and environments shift ever more from the physical to the virtual. In order to maintain a place in this dual environment, most research libraries strive to replicate their traditional offerings in the digital world.

Over the past three years the Wellcome Library has followed a transformation strategy to make the Library digital [1]. The ambition was not to create an online shop window, but instead permanently break the bonds imposed by a physical library and provide full access to our collections in new and innovative ways. We aimed to create an entirely new digital presence based on the Wellcome Library’s historic foundations and modern personality.

At the heart of this transformation strategy are two key elements: mass digitisation of unique collections and the creation of an online experience that transcends the generic offerings of mass distribution and offers a unique value.  Although this work is still in progress, thus far we have successfully integrated the digital content into the Library’s existing discovery mechanisms and created new discovery routes via interpretive content and tools such as tailored subject browse, full-text search within items, and an engaging timeline application. Plans to incorporate full-text search across digitised collections, and the creation of an engaging online environment promoting exploration of the collections designed for the ‘curious public’, will go still further.

The overarching aims of the technological development of the digital library were to create a modular framework that integrated existing and new systems, provide a foundation for future growth and sustainability, and accommodate policy decisions around licensing and access.

logo of Wellcome Library

This article describes how we set about engineering this transformation; how we tackled technological and policy barriers and paved the way for future development and growth, and describes some of the key lessons learnt so far.

Strategy

The digitisation strategy [2] at the Wellcome Library has been through multiple iterations over the years*. In 2010 the Wellcome Trust Board of Governors approved a multimillion pound digitisation programme that set the stage for an ongoing programme of mass digitisation. The content for the first ‘pilot’ phase was largely focused on the theme of genetics, ‘Codebreakers: Makers of Modern Genetics,’ [3] which included our own collections complemented by archive holdings held by other institutions. Mindful of the need to respond to emerging partnership opportunities, additional projects were added to the mix, including a project with  ProQuest to digitise books for the Early European Books [4] initiative and a project part-funded by Jisc to digitise the Medical Officer of Health reports for Greater London [5].

Although the main focus of the Wellcome Library’s collection is on the history of medicine and providing access to researchers in this field, the Library’s resources also provide value to many other historical and social scientific fields. These were the key audiences for the digital library. However, the Library also aims to increase its potential to improve public understanding of medicine and science. The long-term goal is to work with the Wellcome Collection [6] in bringing digital resources to the wider ‘curious public’ in conjunction with exhibitions both physical and online. The digital library infrastructure can provide a technological framework for this.

A key element of the pilot programme was to build a sustainable and expandable mechanism for creating, storing and delivering data – the basis for a new digital library.  In developing this infrastructure however, the Library was highly conscious of the fact that it already had some key systems in place – a Library management system (Sierra), a discovery application (Encore), and a digital asset management system (Safety Deposit Box) – which would need to be integrated into the digital library infrastructure. Furthermore, it was agreed that the digital content had to be fully discoverable alongside the Library’s physical holdings.  This required a single search and discovery experience that would allow users to find all relevant content, irrespective of whether it had been digitised or not.

Principles for Technological Change

The Wellcome Library has been digitising image-based content for many years, and making this available through an image library - the Wellcome Images portal [7]. However, although an image library is useful for finding individual images, it is not the right tool for providing users with the ability to browse, read and download entire texts or archive collections. It lacks a number of features that are required for conducting research on Library materials and does not have the necessary technological framework in place to expand beyond single-image, fixed-resolution access.

Consequently, when the Library embarked upon its mass digitisation programme, it was self-evident that we would need to build a new framework through which this content could be accessed in innovative and engaging ways. However, the digital library was never intended to be a stand-alone system. There was a desire to ensure that effort was not duplicated, and that system administration did not become overly complicated. Therefore, although the system was to be modular – to ensure that individual elements could be independently upgraded or replaced as required – there was to be little to no duplication of pre-existing functionality.

There are three main areas where convergence is not always achieved in the standard digital library due to policy decisions or simple technological complexity:

The Wellcome Library has demonstrated that such convergence is not only desirable, but entirely possible; even if there are still areas that need further work. It is early days in terms of receiving significant user feedback on our results, but from a technical point of view, much has been achieved in this area.

Prototyping and Procuring the Digital Delivery System

The feasibility of the Wellcome Library’s  digital delivery plans were tested through a series of studies carried out in 2010 [8]. This feasibility phase included:

The feasibility phase was the first step to developing an integrated piece of work on a series of system components, the crux of which was a new application which became known as the Digital Delivery System (DDS) that would interoperate with SDB and the Library catalogue, and incorporate an entirely new digital content ‘Player’.

In summer 2011, the DDS project formally began with establishing the business requirements. The key requirements were to:

Faced with this somewhat daunting set of pre-conditions, the Wellcome Library issued an Invitation to Tender (ITT) to develop a technical specification for the DDS.  The output from this work would be a written report laying out the specifications for a DDS - not any working code.

After assessing the responses, two suppliers were awarded the contract to develop this specification.  Although this approach did incur additional costs – as the Library was ultimately only going to contract with one supplier to actually build the DDS – it did provide an opportunity to truly assess the capabilities of two suppliers (especially in terms of demonstrating how their proposed solution would integrate with the Library’s existing systems) before contracting with them to build the DDS solution.

Digirati, a digital agency that provides digital strategy, design, integration and engineering consultancy [14], was subsequently awarded the contract to build the DDS and the Player. They also implemented a migration of the entire Library Web site to a new Web site content management system (CMS), SDL Alterian Content Manager.

In order to facilitate a creative, iterative approach, the Library followed an informal, agile project management methodology. In practice, this entailed working according to 2-week ‘sprints’ – or work-packages – where each work-package constituted a separate contract with the supplier. This approach reduced the risk inherent in creating inflexible specifications at the start of a project that cannot easily adapt to changing requirements or influences.

This approach however increased the risk of scope-creep, and made it difficult to forecast what the total spend would be. To mitigate these risks, an overall development timescale (with major milestones) and budget envelope was developed and fixed.  This, coupled with fortnightly meetings to review the deliverables of each work-package and assess priorities for the next one, helped to keep the project on time and to budget.

Overall, this approach has proved highly effective in allowing the Wellcome Library to achieve its original goals, and to incorporate new ideas in a well-defined and resourced manner as the project took shape.

Alongside procurement of a development partner, the Wellcome Library also commissioned a Web site design company to analyse and redesign the information architecture for the whole Wellcome Library Web site, and to design a new ‘Codebreakers’ microsite to showcase to researchers the online resource of the genetics-related content created by the Library’s digitisation projects.

This procurement followed a more traditional pattern with a call for proposals and presentations by a number of relevant companies. The Library appointed Clearleft [15], a user-experience-based design company which worked on the project from first principles, such as defining audiences and interviewing actual users through to creating much of the HTML code that could be used by Digirati as its staff built the new Web site.

The design brief Clearleft worked to also changed and grew as the project progressed and the Library became more experienced with and aware of how deeply user experience informs the creation of an online resource.

Metadata and METS

The functionality of the Library’s DDS is predicated on metadata. Metadata are often classed as ‘descriptive’ or ‘administrative.’ However, the boundaries between them are not always clear to anyone other than a metadata expert, so here they are described according to what the metadata achieves.

 

Metadata type

Stored in

Used by

Used for

Bibliographic, item descriptions

Library catalogues

Player, via API

Populates ‘More info’ panel [16]

Access level

(open, restricted, etc.)

METS

Player

Informs access restrictions where required and prompts a login box where appropriate [17]

Usage rights

(CC-BY-NC, private use only, etc. stored as a code “A”, “B” etc.)

METS

Player configuration file

Player

Displays correct message found in configuration file according to code included in the METS [18]

Logical description

(pagination, detail of sections such as “cover” or “table of contents”)

METS

Player

Allows navigation by page number; builds index in ‘Contents’ pane etc. [19]

File information

(image format and size, filename, file type, unique ID of file, list of ALTO file IDs, etc.)

METS

Image server

Player

To retrieve digital content from SDB or the cache and display it correctly; find ALTO files for search term highlighting in images [20]

Table 1: Metadata used by the Digital Delivery System (DDS)

It will be noted that much of the metadata necessary for the Player to render content on the Web is stored in METS files [13]. METS is an XML schema that contains information about digital objects, and, in the Wellcome Library’s case, is used to contain information to inform the delivery of digital content. METS can also be used to contain preservation metadata, but in this case all preservation metadata is contained in SDB.

Figure 1: The Player, with the 'More Information' and 'Index' panes expanded

Figure 1: The Player, with the 'More Information' and 'Index' panes expanded

METS files as used here are subject to change (for example, if an access level changes) and therefore are not static representations of objects in the Library’s collections and are not archived.

METS files are created by the Library’s workflow system, Goobi, which automates the ingest of content into SDB and aggregates data from the Library catalogues, SDB and user-generated data into METS [21]. The DDS transforms the METS metadata to JSON so that the Player can read it and render content appropriately.

Metadata such as pagination is added to each item manually, whilst access levels and licence codes are assigned by default according to specific projects, and then adjusted manually by exception, or manually assigned to specific sections within an item (see below for further details of authentication functionality).

Figure 2: Conditions of Use message

Figure 2: Conditions of Use message

Content is held together with a persistent unique ID based on the system number allocated by Sierra, the library management system used by the Wellcome Library. Each catalogue record contains this system number and it is carried into SDB and stored in the METS files. As the system ID is already present in the catalogue record, it facilitates automated link-creation by Encore (URL path + system ID), the Wellcome Library’s discovery platform.  The connecting ‘cycle’ from clicking a link to a digital item, to opening the Player, to retrieving the content, to linking back to the catalogue record, rests on the passing forward of this unique ID.

Figure 3: Highlighting of search terms and search-term navigation for text-based content

Figure 3: Highlighting of search terms and search-term navigation for text-based content

SDB assigns its own ID to each object and to each file. For example, a book of 100 pages will have 101 unique SDB IDs. These are all stored in the METS file so the Player can retrieve individual files from SDB when that book is requested by a user.

The Player and JPEG 2000

The Wellcome Library Player enables a variety of content to be viewed via Web browsers on desktop or mobile platforms.

The Player draws on various technologies to achieve this, specifically:

The Player was custom built by Digirati based on designs user-tested and wireframed by Clearleft, and aims to provide all the basic functionality that today’s users expect from an online viewing experience.

Basic functionality for audio-visual material including play, stop, pause, fast-forward, volume control and transcript download is common to all online players, and was not particularly difficult to define (in future, as our requirements for A/V content becomes more complex, this may change) [22].

Image-based content, however, was more complicated due to the different types of content being represented – from grey literature to oil paintings.  Particularly difficult was incorporating printed-book-style functionality such as page navigation with an elegant, smooth and zoomable image viewer. After months of research and development, the end result brought together the elegant image navigation of Seadragon [23] and the Library’s own user research-based navigation and page design ideas to provide a seamless, engaging and useful all-in-one Player [24].

The Library’s archival image format is JPEG 2000 (namely because of its intelligent compression algorithm that reduces storage costs by up to 80%) [25]. JPEG 2000 files cannot be displayed in Web browsers, so any JPEG 2000 solution needs to convert images to JPEG (or another Web-friendly format) for display.

The DDS incorporates an image server application called IIPImage [26]. IIPImage can quickly and efficiently convert JPEG 2000 files on-the-fly to JPEG. It does this tile by tile, so as you zoom and pan across an image it can serve up JPEG tiles according to tile requests made by the Player. When an item is opened for the first time in the Player, the DDS will ensure that the JP2 file for any visible page is copied from SDB to the DDS’ local cache, so that the IIPImage image server can begin to generate tiles in response to requests from the Player. A JP2 file will only be scavenged from the cache if it is not being used, meaning that over time the cache will hold the most popular content. The generated tiles are also cached, but for shorter periods, and depending on access conditions.

The Player is a ‘client’ of the DDS and therefore acts as an API to the DDS (and the digital content).  This API allows the Player to be embedded into any Web site or blog, and it will look (almost) and act exactly the same as it does on the Wellcome Library Web site (the user can define the size of the Player, although at any size there is a full screen option).

Access and Licensing

Policy Developments

In addition to creating content and building a digital repository, the Wellcome Library had to reframe policies around sensitivity assessment, copyright clearance and licensing of content for online use [27]. The implications of these policies had a significant impact on the development of the technical solution.

Access levels for digital content include:

End-user licences had to be developed to include a variety of conditions of use. For out-of-copyright content or orphan works that are openly available, a Creative Commons licence is used (usually CC-BY-NC) [28]. Depending on permissions granted by copyright holders, licences will vary and may include a restriction on download options, particularly downloading entire books as a PDF or high-resolution versions of individual images.  Archival content such as personal or organisational papers is typically made available under an attribution non-commercial licence with the additional condition that sensitive or personal data not be misused. Conditions of Use statements are displayed at point of use in the ‘More info’ tab of the Player.

Developing Authentication Functionality

The fact that sensitive items are included in our digital repository meant that appropriate security measures for the digital content (and metadata related to it) had to be factored in when building the digital library. The development of authentication functionality to allow or restrict access to certain classes of material meant that an authentication check had to apply to all media types (including all image tiles), and administrative metadata  - all without affecting performance.

Figure 4: Library online registration forms

Figure 4: Library online registration forms

In keeping with the approach to use existing systems wherever possible, the Player uses the Sierra authentication system - the same one Library members use when checking their loans or requesting items from the closed stacks.  In order to do this the Player calls the Sierra authentication API to check whether the entered username and password is valid.

If a user does not have a Library card (or does not wish to join the Library) they can still log in using a social media account (Twitter, Facebook etc.).  The first time they do this for a ‘Registration required’ item, the system will present the user with the Library’s Terms and Conditions, which they will simply need to agree to before content can be viewed; or, for a ‘Clinical’ item they will request permission from the Wellcome Library, and thereafter can log in and view clinical items.

Figure 5: View of archival content by unauthorised user

Figure 5: View of archival content by unauthorised user

In addition to this relatively simple authentication layer, the Player has been developed to restrict specific images within a multi-image object, such as a file of letters.  This image-level approach to authentication has allowed the Library to maximise the amount of content it can make available online.  Under this model if an object of, say, 100 images, includes 5 images that are highly sensitive, the Player can still make the other 95 images available.  If the Library had simply taken an object-based approach for authentication then, in this example, all 100 images would have been restricted or closed (and not available online).

In order for the Player to do this, it must refer to the Access information contained in the JSON file for each item, and for each file within that item. Once this has been done, the Player can match that code against the user account details and present the appropriate display (allow the user to view the content, ask the user to register, or prevent the user from accessing content that would never be permissible to view online).

Figure 6: Player log-in screen

Figure 6: Player log-in screen

Other security measures are in place to prevent access to data held in the METS files that may pertain to sensitive images, prevent re-engineering URLs or image requests, and more. This enables the Library to manage and preserve even highly sensitive archives in SDB.

Seven Simple Steps of DDS Operation

Step 1: The user discovers a link to the digital content in the Library catalogue or anywhere else on the Web.

Step 2. Clicking this link passes a command to the Player to open and to find a METS file with the filename corresponding to the system ID in the URL.

Step 3. The Player opens the METS file, transforms the metadata to JSON.

Step 4. The Player reads the SDB file IDs in the JSON file and requests individual files as appropriate from SDB.

Step 5. For JPEG 2000 files, IIP Image server caches the JPEG 2000 files retrieved from SDB and creates and caches JPEG tile derivatives .

Step 6. The Player checks the authentication status of the user against the access level of the item and its individual files.

Step 7. The Player displays the content, or displays the appropriate login box/message.

Figure 7: Simplified view of the Digital Delivery System (DDS)

Figure 7: Simplified view of the Digital Delivery System (DDS)

Conclusion

The information architecture implemented by the Wellcome Library is a cornerstone of functionality to come. Changing user requirements, new funding opportunities, innovation, and the size and variability of the resources digitised will continue to shape the online spaces the Library provides.

Transformation is a continual process, but establishing a technical framework based on sound principles is an essential first step. By the end of 2013, the Library expects to have all the key technical components in place to deliver its goals for greatly expanded access, engagement and reuse of the digital collection.

*Editor’s notereaders may be interested in earlier articles from colleagues at Wellcome:

Chris Hilton, Dave Thompson. "Collecting Born Digital Archives at the Wellcome Library". January 2007, Ariadne Issue 50 http://www.ariadne.ac.uk/issue50/hilton-thompson/

Chris Hilton, Dave Thompson. "Further Experiences in Collecting Born Digital Archives at the Wellcome Library". October 2007, Ariadne Issue 53 http://www.ariadne.ac.uk/issue53/hilton-thompson/

Dave Thompson. "A Pragmatic Approach to Preferred File Formats for Acquisition". April 2010, Ariadne Issue 63 http://www.ariadne.ac.uk/issue63/thompson/

References

  1. Transforming the Wellcome Library 2009 – 2014
    http://wellcomelibrary.org/about-us/library-strategy-and-policy/transforming-the-wellcome-library/
  2. Digitisation at the Wellcome Library http://wellcomelibrary.org/about-us/projects/digitisation/
  3. Codebreakers: Makers of Modern Genetics
    http://wellcomelibrary.org/using-the-library/subject-guides/genetics/makers-of-modern-genetics/
  4. Early European Books http://eeb.chadwyck.com/home.do
  5. London Medical Officer of Health project
    http://wellcomelibrary.org/about-us/projects/digitisation/london-medical-officer-of-health-project/
  6. Wellcome Collection http://www.wellcomecollection.org/
  7. Wellcome Images http://wellcomeimages.org/
  8. Henshaw, C, Savage-Jones, M and Thompson, D. “A Digital Library Feasibility Study” LIBER Quarterly , Vol. 20, no.1 (2010) http://liber.library.uu.nl/index.php/lq/article/view/7975
  9. Safety Deposit Box http://www.digital-preservation.com/solution/safety-deposit-box/
  10. Encore discovery system http://encoreforlibraries.com/
  11. Sierra Services Platform http://sierra.iii.com/
  12. JPEG 2000 http://en.wikipedia.org/wiki/JPEG_2000
  13. METS http://www.loc.gov/standards/mets/
  14. Digirati Digital Business Solutions http://www.digirati.co.uk/
  15. Clearleft http://clearleft.com/
  16. Wellcome Library player: Click on the ‘More information’ prompt to the right of the main window http://wellcomelibrary.org/player/b1803469x
  17. Wellcome Library player: Crick papers, requiring registration and login
    http://wellcomelibrary.org/player/b18167214
  18. Wellcome Library player: Click on the ‘More information’ prompt to the right and click on ‘View conditions of use’ http://wellcomelibrary.org/player/b1803469x
  19. Wellcome Library player: Click on the Index prompt to the left http://wellcomelibrary.org/player/b1803469x
  20. Wellcome Library player: Enter a keyword search in the search box at the bottom
    http://wellcomelibrary.org/player/b1803469x
  21. Goobi Intranda version http://www.digiverso.com/en/products/goobi
  22. Digitised video The Five (1970) http://wellcomelibrary.org/player/b16672422
  23. Seadragon http://en.wikipedia.org/wiki/Seadragon_Software
  24. The Player: a new way of viewing digital collections
    http://blog.wellcomelibrary.org/2012/11/the-player-a-new-way-of-viewing-digital-collections/
  25. JPEG 2000 at the Wellcome Library blog, an account of the Library’s JPEG 2000 journey http://jpeg2000wellcomelibrary.blogspot.co.uk/
  26. IIPImage http://iipimage.sourceforge.net/
  27. Access to Archives policy http://wellcomelibrary.org/content/documents/access-to-archives.pdf
  28. Copyright clearance and takedown
    http://wellcomelibrary.org/about-this-site/copyright-clearance-and-takedown/

Author Details

Christy Henshaw
Digitisation Programme Manager
Wellcome Library

Email: c.henshaw@wellcome.ac.uk

Christy Henshaw has managed the Wellcome Library’s digitisation programme since 2007.

Robert Kiley
Head of Digital Services
Wellcome Library

Email: r.kiley@wellcome.ac.uk

Robert Kiley is Head of Digital Services at the Wellcome Library.  In this role he is responsible for developing and implementing a strategy to deliver electronic services to the Library’s users – both in person and remotely.