Ernst Mayr Library
Museum of Comparative Zoology
26 Oxford St.
Cambridge, MA 02138
Retooling Special Collections Digitisation in the Age of Mass Scanning
The Biodiversity Heritage Library (BHL)  is a consortium of 12 natural history and botanical libraries that co-operate to digitise and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global 'biodiversity commons.'  The participating libraries hold more than two million volumes of biodiversity literature collected over 200 years to support the work of scientists, researchers and students in their home institutions and throughout the world. BHL also serves as the foundational literature component of the Encyclopedia of Life (EOL) .
Much has been achieved through conventional mass scanning technologies and practices, but a significant portion of early biodiversity literature is quite rare and valuable, sometimes fragile. For example, often the book is too large for most scanning machines, or has folded maps or illustrations that need special handling. In 2008 BHL partners successfully applied for a $40,000 planning grant from the Institute of Museum and Library Services (IMLS)  to identify and develop a cost-effective and efficient large-scale digitisation workflow and to explore ways to enhance metadata for library materials that are designated as 'special collections.' A recent review of special collections digitisation notes that because of the heterogeneity of these collections, it may be difficult to scale up to rapid mass scanning, unless workflow can be streamlined by assembling materials with relatively uniform characteristics .
The lead applicants were the Harvard University Botany Libraries and the Ernst Mayr Library of the Museum of Comparative Zoology (Cambridge, MA). The partner institution libraries included the American Museum of Natural History (New York, NY), the Missouri Botanical Garden (St. Louis, MO), The New York Botanical Garden (Bronx, NY), and the Academy of Natural Sciences (Philadelphia, PA), while the scanning partner was the Internet Archive (San Francisco, CA) — all participants in the BHL. The group actively consulted with other librarians and technology specialists across the country and invited several experts to its meetings to address the goals of the planning grant.
Infrastructure and Baseline Data
The group developed a timeline, a tentative schedule for meetings, and reviewed the budget to keep all tasks on track. A private wiki was established to serve as a repository for all of the documents, notes, links, and other information compiled and contributed by the partners. Two bibliographies were compiled and posted that included papers on special collections digitisation projects and also social tagging applications to facilitate the discussion.
Definition of Special Collections
The meaning of the term 'special collections' varies in different organisations, but it generally refers to rare books, photographs, moving images, fine arts, archives and manuscripts, memorabilia and realia in libraries. Rare books  may be fragile, have unique characteristics such as autographs or handwritten notes made by well-known users, were printed before 1850, be represented by few copies, known to be of high monetary value, are smaller or larger than an average book, or are valuable because they are associated with an institution or individual.
Meetings and Tests
The partners met three times between November 2008 and October 2009 and each agenda centred on the specific goals of the grant for that period. The first meeting was hosted by Harvard's Museum of Comparative Zoology (MCZ). Invited guests included librarians from the Harvard College Library (HCL) Imaging Services, the Smithsonian Institution Libraries (SIL), and the Marine Biological Laboratory, Woods Hole Oceanographic Institute (MBL/WHOI), the Open Knowledge Commons (OKC), and representatives from Digital Transitions and Kirtas Technologies. Each participant was asked to present a summary of their digital projects including details on staff, workflows, technologies and all other issues and costs associated with scanning oversized, fragile, and other categories of special collections. As the partners and invited specialists shared their data, it became apparent that terminology varied and that each institution calculated costs in very different ways. The group agreed to compile a glossary of terms and create a checklist of attributes used to determine what is 'rare,' thus requiring special handling. The group also agreed to develop a spreadsheet to allow for more consistent cost comparisons (see Table 1).
Table 1: Parameters for comparing digitisation costs
The second meeting was held at the Academy of Natural Sciences (ANS) in Philadelphia in April 2009. The focus of the meeting was to determine parameters for some small-scale tests. It was agreed that the same or similar items would be digitised via various workflows (e.g. institutional facility, commercial facility, in-library facility) to produce reasonable comparisons of methods and associated costs. Vendors were identified to participate in the tests, with one vendor reporting that the company could not actually provide the service. Partners recorded all of the associated costs, recorded workflows, and monitored results for comparison.
It became apparent at the second meeting that the partners needed to calculate the number of volumes in the collections that needed special treatment. The American Museum of Natural History Library and the Smithsonian Institution Libraries provided collection profiles based on size derived from OPAC reports and the Internet Archive colleagues provided details from some of their scanning sites associated with rejection rates that included size and condition included as factors. The group also discussed the need to train library staff and scanning technicians to handle special collections material properly. The Internet Archive contributed a 'handling training video' developed with the California Digital Library (CDL). The partners also agreed on a strategy to test the utility of social networking tools by populating a Library Thing site for BHL and Wikipedia pages that could benefit from links to BHL to extend access to digital collections.
The third meeting was hosted by the Internet Archive at their San Francisco headquarters in October 2009. The partners reviewed all of the test results by vendor and associated costs and studied the percentages of items rejected in a mass scanning process. There was a discussion on how other workflows might improve success rates based on experiences at Harvard, Missouri Botanical Garden, and the Smithsonian Institution.
Book Scanning Decision Factors
There are several elements considered during the evaluation process to determine if a volume is suitable for scanning at a high-volume centre or if it must be handled in a different workflow . These factors include the general condition as determined by the fragility, size, and value of a book. Mass scanning facilities usually have minimum (under 4" or 10.16 cm) and maximum (over 18" or 45.72 cm) sizes. Book bindings that are broken or have rot or water damage or other peculiarities (latches, for example) render the book unsuitable for mass scanning. A tight binding, faint text, transparent paper, and other characteristics that may cause data loss, are often rejected by mass scanning facilities. Other causes of rejection include uncut pages, foldouts that exceed scanning capabilities. The partners developed a detailed key to decision factors that are used as filters to identify books that can be sent for mass scanning.
Early in the discussions, the Internet Archive reported that at least 22% of the volumes pre-selected for scanning were rejected from the routine scanning process. However, the University of Toronto scanning centre's detailed statistics indicated that the rate was much higher. Their study revealed that nearly half of the materials in the scanning queue were rejected because the margins were either too tight or too narrow — generally the result of rebinding. Other common rejection factors included fragile bindings, brittle paper, and uncut pages. Overall in the natural history libraries, the items libraries initially selected for scanning might be rejected because of the condition of the binding or text block, size, the inability to deal with foldouts or loose pages, and/or security concerns for rare or unique items (see Figure 1).
Figure 1: Rejection rates for general collections at point of scanning (post-filtering)
The rejection rate for special collections materials, and particularly for natural history materials, was expected to be higher because these materials consist of a higher proportion of commercially valuable books as well as books with foldouts and with other characteristics of rejected items. The Smithsonian Libraries and the American Museum of Natural History Library assessed the presence of oversize volumes in general and special collections that had not yet been filtered for scanning suitability. The result was that 3.86% of the items in the Smithsonian Institution Libraries' scanning lists selected from the general collections were oversize whereas 11.49% of the items in the American Museum of Natural History Library's special collections were oversize. These results support the expectation that a higher proportion of special collections materials are likely to be rejected from a routine scanning process.
There are many equipment configurations that can be used to produce high-quality images. The challenge for the partners was to identify equipment, strategies, and workflows to facilitate the successful scanning of special collections materials at an affordable cost in a secure environment. Needless to say, it was difficult to find a commercial vendor who met all these criteria. The test results indicate that there is no inexpensive way to scan special collections material, but the hybrid solution achieves a high rate of success at a moderate cost.
Harvard librarians were fortunate to observe local tests of two scanning robots. The outcomes indicated that the technology was not appropriate for scanning most special collections materials. Human intervention was still critical to minimise damage and ensure accurate page-turning.
Several technologies were compared for scanning oversize or unusual items. The Internet Archive model accommodated up to 30"w x 20"h (76.2 cm x 50.8 cm)(for foldouts and resulted in about a 70% success rate and averaged $0.20 per page (with a $2.00 charge per foldout). Commercial scanning centres achieved good results at a much higher cost.
Institutional partners provided cost estimates for Financial Year (FY) 2008-2009 based on an average 300-page book and reported costs ranged from a low cost of $30 to a high cost of $52 per book. Commercial vendors, working with some oversize items showed a low cost of $51 per book and a high cost of $120 per book. Their costs for digitising oversize materials ranged from a low cost of $123 to a high cost of $7,000 per volume. The really high cost was for an item for which each page required individual handling using a digital camera since a scanning bed had proved inadequate.
The same title was scanned by different vendors so partners could compare the costs and quality of the product. A volume of The Memoirs of the Museum of Comparative Zoology at Harvard College was scanned at an institutional boutique facility, an Internet Archive scanning facility, and a commercial vendor. A 'hybrid solution' was achieved by sending digital images of oversize foldouts scanned in an institutional facility to the Internet Archive to be 'stitched' into the text scanned by the Internet Archive.
The Internet Archive solution cost $0.11 per page, the hybrid solution was $0.60 per page, the institutional boutique solution $0.80 per page. The commercial vendor determined that their facility could not produce a cost-effective scan, but another partner reported that a similar volume was scanned by a commercial vendor for more than $9.00 per page. It should be noted that costs will vary based on the number of foldouts or other pages requiring special handling. For a volume with many foldouts that are too large to be scanned within the Internet Archive parameters, the hybrid solution may be significantly more cost-effective.
In-library and institutional scanning centres produce good results at intermediate (in-library) to high (institutional) cost, have a much lower rejection rate, and eliminate the cost of moving materials off-site. There are affordable in-library equipment configurations that achieve excellent results. In-library facilities offered four equipment configurations:
- copy stands with cameras;
- fixed, face-up, overhead book scanners;
- flatbed scanners;
- robot scanners.
Lighted copy stands were fitted with a variety of cameras, including a Cambo camera at Missouri Botanical Garden, and Konica, Minolta, and Jenoptik cameras at The New York Botanical Garden. The Natural History Museum, London, used a hybrid approach, using a Grazer Book Cradle and a large-format digital camera. The Smithsonian Institution Libraries used a Phase One DT camera and a P65 digital scanning back. Overhead book scanners deployed included Indus Planetary scanners, an ImageWare Bookeye Repro A2 and a KIC II scanning system. Flatbed scanners ranged from large-format flatbeds to smaller sheet feed scanners. Reports were provided on tests of the Treventus and Scanbot robot scanners, though they were not specifically tested as part of this grant.
The institutional boutique solution offered state-of-the-art technology and experienced technicians. It was very expensive, but achieved excellent results. In-library solutions also provided very good results at a lower overall cost. The least expensive solution turned out to be the hybrid solution. Foldouts were scanned with high-end equipment, either in-house or by a vendor, and then 'stitched in' to the text through the routine Internet Archive workflow. However, the hybrid solution required more handling of the volume, thus increasing the likelihood of damage. When pages are inserted to an existing file the associated metadata may be distorted and must be corrected by hand. Thus, there were potential added costs for quality review and metadata correction. The partners made the general observation that while costs varied widely they have come down in recent years.
Web Services and Social Networking
Web services offer intriguing opportunities to link detailed information to specific notable books. Crowd-sourced social tagging may be a cost-effective way to enhance metadata, particularly when librarian curators manage the tagging . The bibliography on the Biodiversity Heritage Library IMLS Wiki facilitated the group's discussion of how to exploit social media. Most agreed that there is value added when experts are involved, but reaction was mixed with regard to generalist tagging.
A profile for BHL was established at LibraryThing . One partner worked with a LibraryThing developer to import records for books without ISBN numbers. The solution was still tedious and time-consuming and required extensive human intervention to correct the metadata. The process required manual data entry of all of the imported tags for each title to a .csv file, and the typing in of all of the URLs. Since BHL content generally predates the establishment of ISBNs and is represented predominately by serial records, the partners agreed that batch loads of BHL records would be impractical and costly, thus reducing the value of a LibraryThing profile.
Partners identified biographical profiles of important scientists in Wikipedia and linked BHL digital content to the cited references for approximately 85 titles. Partners also used Wikipedia entries devoted to important books to determine if the public tagging of these titles would offer some added value by leading the reader to the complete work. An example is available from Wikipedia .
These links did not generate tagging activity during the study, but they did increase referrals from Wikipedia to the BHL site (see Figure 2).
Figure 2: Increasing referrals from Wikipedia to BHL after content links added
BHL content on Open Library  was reviewed for analysis but there was no activity to report during the grant period. Wikipedia and LibraryThing seem to be the most promising places to gather user-generated feedback and tagging.
Portable Scanning Unit
BHL partners have been scanning the biodiversity literature for more than four years and report that the inventory of easily scanned volumes (known as the low-hanging fruit), is nearly exhausted. However, thousands of volumes remain in the library stacks that should be made available to the scientific community. The partners understand that there are a variety of digital processes that can be employed to scan special collections materials successfully, that in-house scanning achieves a much higher rate of success, but the costs are higher than conventional scanning workflows. Therefore, it is important to minimise duplication of materials and effort.
BHL partners envision a co-ordinated scanning project that will provide temporary on-site scanning of pre-selected special collections materials by deploying a portable scanning unit and trained technician to partners' libraries for a limited period. The on-site service allows for a potentially cost-effective solution for scanning fragile, unique or valuable items that do not lend themselves to over-handling and shipping.
Creating the BHL collection has demonstrated that good bibliographic records are essential in building a shared catalogue. All project partners must ensure that all records are reviewed and improved as necessary. Partners agree that enhancing metadata is a better investment than packing and shipping special collections materials to off-site facilities .
As part of the planning grant, the Internet Archive designed a reusable shipping container for a proposed Portable Scanning Unit (PSU) service. The Internet Archive offered detailed specifications for costs associated with delivery of one Scribe and the electrical, space and technical needs to support a single PSU. Internet Archive network requirements include an open IP address, not blocked by a firewall, for a gateway connection box to a Scribe, Z39.50 access to catalogue records (or metadata can be created pre- or post-scanning to be added to the item being scanned but this is less efficient) and 1.5 megabits 24/7 of bandwidth. This solution has yet to be tested although Harvard will receive a temporary Scribe this summer.
One unique aspect of biodiversity research is that historical data are as valuable and essential as current literature. Improving the digitisation of the literature in libraries' special collections makes this relatively rare information readily accessible. The focus of the project has been to identify cost-effective ways to expand the types of biodiversity literature delivered virtually and to make them easier to 'discover' by scientists, curators, taxonomists, conservationists, and ecologists from all over the world. The results are readily applicable to other disciplines that rely on special collections materials.
The digital collections produced can be integrated with other major scientific and bibliographic databases and collections using contemporary informatics solutions. Policies and relationships must be established to ensure the content remains freely accessible. The scanned content created by BHL is used, and will continue to be used, in a variety of projects. BHL content is part of the Digging Into Data challenge . BHL looks forward to working with the planned Digital Public Library of America . Enhanced access and distribution of special collections content and the associated metadata are essential, not only to biodiversity research, but also to other disciplines in other branches of science and the humanities where they have broad applications.
The partners agreed that the planning grant provided valuable information that can be applied to future projects, and enhanced working relationships among the partners and consultants.
Funding for this study was received from the Institute of Museum and Library Services (Grant number LG-50-08-0058-08; Principal Investigator: James Hanken). We would also like to acknowledge the help of all of our partner colleagues and specialists from other organisations: Danianne Mizzy and Eileen Mathias of the Academy of Natural Sciences, Philadelphia, staff of the New York Botanical Garden, Doug Holland of the Missouri Botanical Garden, Robert Miller of the Internet Archive, staff of the American Museum of Natural History, Danielle Castronovo of the California Academy of Sciences, Aaron Chaletzky of the Library of Congress, Bill Comstock of Harvard University, colleagues at the Smithsonian Institution, Maura Marx of the Open Knowledge Commons, Diane Rielinger of the Woods Hole Oceanographic Institute Library and Jane Smith of the Natural History Museum, London. We would also like to thank the vendors who shared workflows and ideas with all of us: Linda Andelman of Parrot Digigraphic, Lofti Belkhir of Kirtas Technologies, Rudiger Klepsch of Image Ware, David Sempberger of Boston Photo Imaging, and Peter Siegel of Digital Transitions, Inc.
- Biodiversity Heritage Library http://www.biodiversitylibrary.org/
- Gwinn, Nancy E.and Constance A. Rinaldo. 2009. The Biodiversity Heritage Library: Sharing biodiversity with the world. IFLA Journal 35(1): 25-34.
- Encyclopedia of Life http://www.eol.org/
- Institute of Museum and Library Services http://www.imls.gov/ Rare Books and Manuscripts Section, Association of College and Research Libraries Task Force on Digitization in Special Collections.
- Erway, Ricky. 2011. Rapid Capture: Faster Throughput in Digitization of Special Collections. Dublin, Ohio: OCLC Research
- Norman, Jeremy. 2004.What is a Rare Book? The Six Criteria of Rarity in Antiquarian Books http://www.historyofscience.com/traditions/rare-book.php
- ABC for Book Collectors. http://www.ilab.org/services/abcforbookcollectors.php
- Glaister, G.A. Encyclopedia of the book. New Castle, Del.: Oak Knoll Press, etc., 1996.
- Tagging, Folksonomy and Art Museums: Early Experiments and Ongoing Research http://dlist.sir.arizona.edu/2594/
- Studying Social Tagging and Folksonomy: A Review and Framework http://dlist.sir.arizona.edu/2595/
- BioDivLibrary: LibraryThing http://www.librarything.com/profile/BioDivLibrary
- Systema Naturae - Wikipedia, the free encyclopedia, accessed 5 June 2011 http://en.wikipedia.org/wiki/Systema_Naturae
- Open Library http://openlibrary.org/
- Pilsk, S. C., M.A. Person, J.M. deVeer, J.F. Furfey and M. R. Kalfatovic. 2010. The Biodiversity Heritage Library: Advancing Metadata Practices in a Collaborative Digital Library. Journal of Library Metadata 10: 136-155.
- Digging into Data Challenge http://www.diggingintodata.org/
- Digital Public Library of America http://cyber.law.harvard.edu/research/dpla
Ernst Mayr Library
Museum of Comparative Zoology
26 Oxford St.
Cambridge, MA 02138
Constance Rinaldo has been the Librarian of the Ernst Mayr Library of the Museum of Comparative Zoology at Harvard University since 1999 and in addition to her MLS, has an MS in Zoology. Currently she also serves as the Executive Secretary of the Biodiversity Heritage Library. Prior to her work at Harvard, she was the Head of Collections at the Biomedical Libraries at Dartmouth College. Connie is passionate about natural history and making library collections open and accessible.
22 Divinity Avenue
Cambridge, MA 02138
Judith Warnement has served as the Librarian of the Harvard University Herbaria's Botany Libraries since 1989 and is interested bibliographic description, conservation, and digitisation of botanical literature. She serves on the Institutional Council of the Biodiversity Heritage Library and is an active member of the Council on Botanical and Horticulture Libraries and an associate member of the European Botanical and Horticulture Libraries Group. Judy holds earned a Masters in Library and Information Science from Case Western Reserve University.
Department of Library Services
American Museum of Natural History
79th Street and Central Park West
New York, NY 10024-5192
Tom Baione is the Boeschenstein Director of Library Services at New York's American Museum of Natural History. Tom's work in the Museum's Library began in 1995, when he started in the Library's Special Collections Unit; most recently he was in charge of the Library's Research Services and was appointed director in 2010.
Smithsonian Institution Libraries
Martin R. Kalfatovic is the Assistant Director, Digital Services Division at Smithsonian Institution Libraries. The Digital Services Division oversees the Libraries digitisation efforts which include digital editions and collections, online exhibitions, and other Web site content. Current projects include work on metadata, standards and intellectual property issues. As the Smithsonian's co-ordinator and Deputy Director for the Biodiversity Heritage Library (BHL), he oversees the Smithsonian's contributions.
The LuEsther T. Mertz Library, The New York Botanical Garden
Bronx, New York 10458
Susan Fraser is the Director of The LuEsther T. Mertz Library at The New York Botanical Garden where she oversees an exceptional staff and an outstanding collection of print and non-print resources. She also plays a pivotal role in the exhibition programme in the William D. Rondina and Giovanni Foroni LoFaro Gallery. Susan received her MLS from Columbia University and is a member of the Academy of Certified Archivists. She is an active member of the Council of Botanical and Horticultural Libraries (CBHL), having served on the Board from 2005-2008. She is currently involved in several committees and has been the CBHL Archivist since 2000.