Automating Harvest and Ingest of the Medical Heritage Library
Christy Henshaw, Dave Thompson and João Baleia describe an automated process to harvest medical books and pamphlets from the Internet Archive into the Wellcome Library’s Digital Services environment.
Overview of the UK Medical Heritage Library Project
The aim of the UK Medical Heritage Library (UK-MHL) Project is to provide free access to a wealth of medical history and related books from UK research libraries. There are already over 50,000 books and journal issues in the Medical Heritage Library drawn from North American research libraries. The UK-MHL Project will expand this collection considerably by digitising a further 15 million pages for inclusion in the collection. The Wellcome Library is incorporating these books into its own digital library, making further strides towards becoming a global online resource for the history of medicine and health.
The UK-MHL Project is funded by Jisc , a registered charity providing digital solutions for UK education and research, and the Wellcome Trust . Content for digitisation is selected from the Royal College of Physicians of Edinburgh, Royal College of Physicians of London, Royal College of Surgeons of England and the Wellcome Library, and six university libraries: King’s College London, London School of Hygiene & Tropical Medicine, University College London (UCL), University of Bristol, University of Leeds and University of Glasgow.
All digitisation is carried out by the Internet Archive, one of the largest digital libraries in the world, and a widely used digitisation bureau . The digitisation centre is located at the Wellcome Library in London, hosting 12 digitisation units and 14 members of staff. The Internet Archive upload both metadata and digitised images to their Web site for quality control and creation of a range of image and text-based dissemination formats, making all content freely available to users at no extra cost to the content providers.
Overview of the Wellcome Digital Library Systems
The Wellcome Library’s systems are described in detail in a previous Ariadne article . Table 1 shows a summary of the key systems involved.
Description and purpose
Digital asset management system (formerly known as Safety Deposit Box) managing long term preservation of digitised and born digital content
“Intranda version” – a workflow system that manages content harvest from the Internet Archive site, image validation and conversion, metadata mapping, encoding access conditions, ingest, creation of METS (Metadata Encoding and Transmission Standard) files, and more
Library catalogue/bibliographic database
Library discovery interface for all physical and digital content
Digital Delivery System (DDS)
Image server that delivers JPEG tiles created on-the-fly from JPEG 2000 image files
Wellcome Library “the player”
Media viewer based on OpenSeadragon and HTML 5 that is the user interface to digital content on the Wellcome Library Web site
Table 1: Wellcome digital library systems
Placing Content on the Internet Archive Site
Books and pamphlets earmarked for digitisation as part of the UK-MHL Project are selected by all the contributing partners from their medical history and related collections within a date range of 1780 - 1914. The aim is to create a broad-based resource that reflects the interests and knowledge of those involved in medicine and healing in the 19th century, so the subject areas are varied, including both core medical subjects such as anatomy, surgery or neurology, and broader health or body-related topics including physical exercise, cookery, and phrenology.
Figure 1: Artistic Anatomy, by Mathias Duval, 1884, page 18
Once the selection has been made, all partners export their catalogue records and send them to the Wellcome Library where they are compared to titles that are already available on the Internet Archive, or are already on the list to be digitised. Rates of duplication are generally between about 20% and 40% depending on the collection. An item is considered a duplicate if there is a copy of the same edition available on the Internet Archive site that also has an available MARC record online.
In order to do this comparison, we downloaded all the available MARC records for the Internet Archive’s 19th-century monographs and created a “master list” of existing digital content, which grows each time more unique titles are discovered and earmarked for digitisation. We can also root out any duplicates within collections – although in some cases, there is good reason to digitise duplicates; where there are interesting annotations, for example.
Before the comparison is possible it is necessary to convert all descriptive metadata delivered as separate MARC21 records into a single CSV (comma-separated values) file. This is done using a XSLT (Extensible Stylesheet Language Transformations) document that selects only material types considered in scope and concatenates part of the author name, part of the main title and the date of publication in one string. It also adds, in a separate column, the unique system or institutional identifier for that record (normally in 001 Marc field).
The resulting document is then compared to the “master file” using a VlookUp function which matches on the concatenated strings because there is no reliable universal unique identifier for this material.
For each record to be ingested in Preservica a “stub Marc record” is then created, containing the unique system identifier, and loaded in our bibliographic database. These stub records are used as a match point so our bibliographic database only loads the full records of works we want to digitise or harvest from the Internet Archive.
Identifying Previously Digitised Medical Books
This process also allows us to identify those we can immediately harvest from the Internet Archive into our own digital library. Although all the monographs and pamphlets currently part of the Medical Heritage Library collection will be harvested, we are now finding relevant titles that have never been included in that collection. Comparing medical history collections from all the partners allows us to identify these titles in the larger corpus of the Internet Archive’s 19th-century collections.
Once de-duplication is complete for a collection, the catalogue records are loaded into Sierra where they are assigned unique identifiers that will be used throughout the digitisation process. When the books are delivered for digitisation, they are accompanied by electronic “scan lists” that provide the unique IDs of all books in a shipment, allowing the Internet Archive to harvest records from the Sierra database using z39.50. Internet Archive can then digitise the items and create all the dissemination formats. Once this is complete, anyone can access the content either by using the Internet Archive browser interface, or by downloading content using an open API.
Creating a Medical Heritage Library Mirror at the Wellcome Library
All of the Medical Heritage Library content will be harvested and ingested into the Wellcome Library’s digital library system. Two new harvesting workflows were developed in Goobi to achieve this.
Items newly digitised via the UK-MHL Project are assigned a ‘collection identifier’ on the Internet Archive site, and our workflow tracking system, Goobi, has been developed to poll the Internet Archive site for records that display this identifier. Records with this identifier are automatically harvested once they are discovered, although we have enforced a 25-day automatic waiting period from date of upload. This is to ensure that the Internet Archive can complete its quality-control process and creation of dissemination files. The unique Sierra ID number (known as a .b number), allocated at the point of digitisation, allows all the content to be linked together in the Wellcome’s systems.
The second workflow harvests items that are already in the Internet Archive but not digitised as part of the UK-MHL Project. As the items were not digitised as part of the UK-MHL Project, the Internet Archive records do not contain the unique Sierra .b number, and therefore cannot be automatically harvested using a collection identifier as there would be no way to link them to the metadata already held in Sierra. So, in this case, the records downloaded to create the ‘master list’ for de-duplication are loaded into Goobi first, containing both a Sierra .b number and the Internet Archive’s pre-existing identifier providing a match point. Goobi can then use the Internet Archive identifier to harvest the content, and match content to metadata using the associated Sierra .b number.
Goobi acts as a gateway to the internal network, where the content is ultimately stored. At this point, the two workflows converge and follow the same series of steps.
Figure 2: Medical Evidence in Railway Accidents by John Charles Hall, 1868 
Once the content has been harvested, Goobi imports the structural metadata from the Internet Archive “scandata.xml” file into its database. This includes pagination information and logical structures such as covers, table of contents, title pages, and similar divisions, as identified and marked up by the Internet Archive. The OCR (Optical Character Recognition) file, downloaded as a raw Abbyy.gz file, is used by Intranda to create Analyzed Layout and Text Object (ALTO) XML files, the basis of our full-text database in the player (for ‘search within’ functionality), and Encore (for full-text search across all items, including snippet display in the results list).
Goobi runs a JPEG 2000 validation program called jpylyzer to ensure all the image files are valid .jp2 files (Part 1) before triggering automated ingest into Preservica. Once Preservica has ingested and characterised the files, it returns administrative metadata to Goobi including unique IDs for each image file. This is merged into the Goobi database.
Once all the steps have been completed to bring together the information required by the digital delivery system, Goobi exports a METS file that contains the administrative data (image sequence, pagination, logical structure, filenames, file IDs, references to ALTO files, etc.) The METS is exported to the Digital Delivery System (DDS) and stored alongside the ALTO files. The METS files are not considered preservation objects in themselves and so are not stored in Preservica.
UK-MHL (new digitisation) workflow
Internet Archive (existing content) workflow
Generate or import catalogue records
Import externally sourced MARC records to Sierra
Import all MHL records harvested from the Internet Archive
Flag unique records as “to be digitised” or “to harvest”
Import Internet Archive records into Goobi from Sierra
Identify and request downloads from Internet Archive
Poll Internet Archive for any new UK-MHL content and download content
Search Internet Archive for specific IDs and download content
Transfer content to Wellcome network
Download relevant content to Goobi, transfer to internal network
Validate images with Jpylyzer validation script
Add structural metadata
Map structural metadata to Goobi database
Create ALTO files
Intranda (external service)
Create ALTO files from raw OCR, and export to digital delivery system
Ingest images to Preservica
Ingest images and descriptive metadata to Preservica
Add administrative metadata
Export administrative metadata from Preservica, and map to Goobi database
Create and export METS
Create and export METS files via Goobi to digital delivery system
Delete unnecessary files
Delete image and OCR files originally harvested from the IA Web site and no longer required on the Goobi servers
Table 2: Tasks and responsibilities
Aside from the aim of creating a UK Medical Heritage Library the key principle behind this project has been to apply high levels of automation to the acquisition and processing of content. The purpose being to create both a scalable and sustainable activity. The project was built by extending and developing existing tools and systems. Having created automated processes for this project, we are now in a position to apply the work to future projects.
The task of automation was made simpler by the metadata that the Internet Archive created. The structural and raw OCR files that are available from the Internet Archive Web site could be processed to create METS and ALTO files automatically.
Although automation of many of the basic tasks was achieved there was still human resource required for some of the processes such as the metadata harvest and import and setting up the de-duplication process.
- Jisc http://www.jisc.ac.uk/
- Wellcome Trust: Looking for a grant? http://www.wellcome.ac.uk/?gclid=CJLoj87I-cACFSXKtAodyiIA2A
- Scanning Services: Digitizing Print Collections with the Internet Archive https://archive.org/scanning
- Christy Henshaw, Robert Kiley. “The Wellcome Library, Digital”. July 2013, Ariadne Issue 71
- Mathias Duval, 1884. Artistic Anatomy, page 18
- John Charles Hall, 1868. Medical Evidence in Railway Accidents http://wellcomelibrary.org/player/b20400640
Web site: http://wellcomelibrary.org/
Christy Henshaw has managed the Wellcome Library’s digitisation programme since 2007.
Web site: http://wellcomelibrary.org/
Dave Thompson manages the systems; Goobi and Preservica, associated with digitisation on a day-to-day basis.
Web site: http://wellcomelibrary.org/
João Baleia supports public-facing library systems and resolves metadata-related problems.