Peculiarities of Digitising Materials from the Collections of the National Academy of Sciences, Armenia

alan hopkinson; tigran zargaryan

Peculiarities of Digitising Materials from the Collections of the National Academy of Sciences, Armenia

Alan Hopkinson and Tigran Zargaryan give an overview of their experience of digitising paper-based materials in the Fundamental Scientific Library of the National Academy of Sciences, Armenia including some of the obstacles encountered during image processing and optical character recognition.

Early writing which first appeared as cuneiform protocols and then emerged in manuscript form and as printed materials is currently entering a new stage in its development – in the form of electronic publications.

The Internet has drastically changed our understanding of access to library resources, to publication schemas, and has introduced brand new ways of information delivery. And as a result, the present situation could be described as a continuous increase in the amount of material being published only in electronic form, together with wide-scale conversion of paper-based material to digital formats. And this tendency will only intensify in the coming decades, covering more and more geographical areas, countries and language groups. More and more librarians, image-processing specialists, and metadata creators will be involved in this process. Information Science specialists will develop and propose new algorithms for e-resource description, information discovery and retrieval.

Background: The Fundamental Scientific Library

The Fundamental Scientific Library (FSL) of the National Academy of Sciences (NAS) is the main and largest repository of scientific publications in the Republic of Armenia in the fields of Humanities & Social Sciences, Precise & Natural Sciences, Technology, and Medicine. The rich and diverse collections of the Library include over 3 million publications in Armenian, European, Asian, and Slavic languages. Besides modern publications, there is high demand from scholars of different countries in respect of collections of incunabula in Armenian, Gothic and old Latin.

The first Armenian book was printed in Venice in 1512 by Yakob Meghapart (Jacob the Sinful). Between 1512 and 1513 he printed five titles: Urbatagirk (Friday Book), Parzaytumar (A Simple Calendar), Pataragatet” (Missal), Altark (An astrological treatise and Tagharan (Song Book). The first Armenian journal Azdarar (The Monitor Monthly) was published in 1794 in Madras. The first Armenian map Hamatarac asxarhacoyc (Large World Map: the two hemispheres) was published in 1695 in Amsterdam.

Following this embryonic period, Armenian printing spread rapidly during the 16th-18th centuries into cities where there were significant Armenian populations such as Venice, Constantinople, Rome, Lvov, Milan, Isfahan, Livorno, Marseilles, Amsterdam, Madras, Calcutta, Smyrna, Etchmiadsin, Trieste, Petersburg, Nor Nakhijevan, Astrakhan. Thanks to Armenian bibliographers Dr. Ninel Voskanyan, Dr. Hakob Anasyan, Dr. Arsen Ghazikian, Dr. Hayk Davtyan, Dr. Knarik Korkotyan, Revd. Dr. Vrej Nersessian and other scholars, the early printed book collection has been catalogued to the highest standard in four sequences: Armenian rare books (time period 1512-1800); and Armenian early printed books (time periods: 1801-1850, 1851-1900 and 1901-1920). All these collections are very fragile. The FSL has one of the largest collections of Armenian rare and early printed books and periodicals. In 2012 Armenia will celebrate the 500th anniversary of book printing, and with this celebration in mind, the Scientific Board of FSL decided in 2007:

to establish a digitisation centre in the Fundamental Scientific Library;
to implement a modern scanning and conservation centre for the care of vulnerable resources.
to create high-quality digital copies from the original materials for preservation purposes;
to make the metadata and images of digitised materials (in portable document format (PDF) ) freely available via the Web to researchers, students and educators all over the world.

Preservation of Armenian Rare and Early Printed Books

Bearing in mind the fact that digitisation projects require high-quality professional digital cameras, a special book-handling system with motorised column, book cradle and book shuttle, and more, in 2008 FSL applied to the Endangered Archives Programme of the British Library with a grant application ‘Preservation Through Digitisation of Endangered Armenian Rare Books and Making Them Accessible on the Web’. A grant was approved, Revd. Dr. Vrej Nersessian was appointed as a project expert, and Alan Hopkinson of Middlesex University, UK was asked to take charge of the project’s general management. Tigran Zargaryan was appointed as a project director.

This project, from the outset, could be characterised as unique for Armenia in its scale, the novelty of the decisions involved and the hardware deployed. The lessons learnt during the project’s implementation and the solutions that were adopted as a result served as a sound basis for initiating other digitisation projects in FSL. During the lifetime of the project, many technical difficulties were resolved, a lot of practical finesses were teased out of our mistakes, and we hope that this article will serve as a guide for practitioners who are initiating similar projects in their own organisations.

Three principal stages evolved during the project:

photographing the originals using a high-quality digital camera;
saving the images on the high-quality DVD discs (producing 2 copies for each material - Preservation Copy and Access Copy), and;
mounting the images with relevant metadata on the Web for public access.

Stage One: Photographing Originals

The Endangered Archives Programme [1] and NARA Guidelines [2] both advise that preservation copies be saved in Tagged Image File Format (TIFF), which means that when purchasing a digital camera you must be mindful of that fact, since not all digital cameras can provide this format. If the camera does not support the saving of images as TIFF files, an alternative is to save them as RAW [3] files and later convert them to TIFF. Preservation copies should notbe saved as a JPEG (Joint Photographic Experts Group) file and then to TIFF.

With this in mind, we ordered digitising equipment from Icam Archive System Ltd [4], including a PhaseOne digital camera with 7.2K x 5.4K pixel array producing a 112 MB 24-bit TIFF image. After analysing the existing literature, and drawing upon accumulated experience [1][2][5], as well as taking into account our own tests for digitisation of Armenian rare and early printed books and periodicals, the following strategy was adopted:

Preservation copies will be produced in TIFF format and burned on DVDs with gold surface.
Access copies will be produced in TIFF format and burned on DVDs with silver surface.
After preliminary image processing (cropping out unnecessary edges, reducing image size, assembling all images of one book under its title), the JPEG version of a book will be produced for mounting on the Web.
During image photographing use will also be made of a colour chart to monitor the colours being reproduced and to help ensure quality and consistency of images.
Images can not be passed through the optical character recognition (OCR) process, since for Armenian typography of the 17th to 19th centuries, no character recognition system has yet been developed.

Stage One: Lessons Learnt

Unsurprisingly for such an innovative project for its participants, there was a series of obstacles encountered and solutions devised.

Camera Shutter

The most vulnerable part of the camera is its shutter. After several thousand shots (usually 70,000 – 100,000), the camera shutter must be changed, and when preparing a budget for digitisation projects it is important to consult with the hardware supplier on possible repair and maintenance costs, and include these expenses as a separate line.

Figure 1: Digitised image with colour checker chart

Colour Checker

The Colour Checker Chart is designed for use as a colour calibration tool for use by photographers in both traditional and digital photography. When used along with colour management software, it allows the operator to calibrate the camera in line with the monitor and printer so as to get an even workflow with accurate colours throughout. The Colour Checker, when displayed with the image allows the viewer to check that the colour is correct as a ‘known standard’ colour chart is viewed. It is not necessary to include it with every image (often only with the first page), but according to the requirements of the Endangered Archives Programme, it was requested to include the colour chart with each image in order to capture the appearance of the original material as accurately as possible. We are using the Gretag Macbeth ‘mini’ chart, as it is quite small and can be included at the side or bottom of each image. In Figure 1 an image is displayed with embedded colour checker. More details about understanding and modelling colour can be retrieved from the ‘JISC Digital Media’ page [6].

Documenting Dimensions

To record the size and dimensions of the material being digitised we made use of horizontal and vertical rulers in the first image taken for each volume.

Figure 2: Digitised image with horizontal and vertical rulers

Stage Two: Preserving Image Data

The images produced during the digitisation process must be of high quality, must be preserved for the future, and must also be available to library readers. This means that 2 copies of the same image, one for preservation and one for access, should be produced. For the preservation copy it was agreed to use 4.7 GB gold DVD-R discs, and for the access copy to use 4.7 GB silver DVD-R discs. For safety reasons it was decided to burn 16x certified DVD-R discs with 8x speed, and according to our measurements the whole process for the burning of 4.7 GB disc takes about 15 minutes.

Stage Two: Lessons Learnt

Professional digital cameras can produce large-size high-quality images. If the colour checker chart is embedded, then the size of images will grow drastically. For example, the size of an image from figure 1, which was photographed using PhaseOne digital camera with 7.2K x 5.4K pixel array, is 34 MB! If you are digitising hundreds of thousands of pages, you must be prepared for the following:

Depending on the number of pages to be photographed you need to obtain DVD-R discs in large quantities.
You will need to have several terabytes of networked external memory, which will allow you to download images from the camera attached to the workstation and retain them permanently. This will free workstation hard disks, necessary since digital cameras work very fast, and the memory on the hard disk memory will be filled within 3 or 4 days.
You need to have several DVD-burning machines, since the image burning process requires a lot of time, and this is where the image production bottleneck is.
And of course you need to have staff and extra computers, occupied for this work.

Stage Three: Mounting Images on the Web

The main goal of the project was the preservation of endangered books and periodicals [7]. However, since the World Wide Web provides the opportunity of unlimited access to e-resources, we decided, as a by-product, to mount all digitised resources on the Web for public use.

Early in the project, one task was to analyse the digital repository software market. Since FSL, as a member of the Electronic Information for Libraries Consortium (eIFL [8]), is actively involved in various projects designed to promote and advocate for a Free/Open Source Software (FOSS) philosophy; and due to the advantages of FOSS products versus proprietary ones, it was decided firstly to explore the FOSS market for possible solutions. After analysing Greenstone, EPrints and DSpace, we decided on Greenstone. As a metadata schema, the Dublin Core metadata [9] standard was selected. Another team of librarians, after developing the database hierarchy, started mounting PDF versions of TIFF images in the repository. It is worth mentioning that the size of each PDF image (after conversion from TIFF and image size reduction) is about 200-250 KB.

Stage Three: Lessons Learnt

Using all FOSS software, Greenstone as a digital repository system, and Drupal as a Content Management System together with EPrints for the repository of research output, with Evergreen and Koha as integrated library systems, we confirmed our initial estimations that FOSS solutions would prove good alternatives to commercial products. They are stable, well supported, easy for localisation, and flexible for changes with well established communities all over the world. The library will continue advocacy for FOSS implementations in Armenian libraries and academic institutions.

Placing Armenian Scholarly Content Online

The experience, knowledge and skills obtained during the digitisation of Armenian rare and early printed books enabled FSL staff to initiate another digitisation project with the aim of making the National Academy of Sciences scholarly content online. The Armenian National Academy of Sciences (NAS) was established in 1943 [10]. NAS with its 29 research institutions today publishes 16 peer-reviewed academic journals [11].

As a first step, we built our own digitising equipment based on the Canon EOS 450D camera (see Figure 2). Since this camera supports RAW and JPEG formats, we selected RAW as an output format, later converting RAW images to TIFF files, which does create an additional step in the imaging workflow. Since the amount of scholarly periodicals from 1940 until the present day is estimated as being around 2 million pages, in order to speed up the digitisation process, it was decided to allocate additionally three scanners, two Xerox and one HP. Although we had already developed and tested digitisation techniques for the digital cameras, we had no experience in respect of the scanners. Another problem to be addressed was establishing an understanding of the basics and techniques of the optical character recognition process when working with Armenian and Cyrillic scripts. As a starting point, we found papers from the Bibliographical Center for Research, Denver [5], Zhou Yougil [12] and UNESCO [13] useful.

Figure 3: Digitising equipment constructed by FSL

Optical Character Recognition

Optical character recognition, usually referred to as OCR, is the conversion of scanned photographic images of text to machine-readable text. OCR is widely used to convert handwritten, typewritten or printed text into electronic files, and to make scanned or photographed documents searchable and editable. TIFF images created from scanned (photographed) documents need to pass one additional step, OCR process, to become searchable and editable.

Based on the findings of Zhou Yougil [12], and following our own scanning experiments, it was concluded that:

documents processed at 300 or 400 dpi (dots per inch) are sufficient for creating PDF files of good quality;
documents processed with resolutions lower than 300 dpi will not possess sufficient definition to create a viable preservation copy; the incidence of OCR errors upon conversion of TIFF objects to searchable PDF files will increase;
documents processed with resolutions higher than 400 dpi will produce very large files with relatively little improvement in quality in terms of OCR procedures.

It was decided:

to use the ABBYY Fine Reader 10 Professional Edition product as OCR software, since articles in NAS periodicals are in Armenian, English and Russian, and ABBYY’s Fine Reader supports OCR for all three scripts.
to scan all documents, as a general rule, in TIFF format at 300 dpi resolution (and only where the quality of the target print is low to scan documents at 400 dpi resolution).
to create a PDF version of the text of those files after cropping out any extraneous material, e.g., white space, etc
to run the ABBYY Fine Reader OCR process on the PDF versions of articles from the Humanities & Social Sciences subject areas.
Articles from science, technology, and medical (STM) subject areas will be saved as PDF images, without passing the OCR process.

Placing Content Online: Lessons Learnt

Making Digital Content Web-friendly

The ABBYY Fine Reader supports various options in respect of OCR, such as:

Convert to Searchable PDF Image: which permits the conversion of image files into an Adobe PDF document in the ‘Text under the page image’ mode;
Convert to Editable PDF Document: which permits the conversion of image files into an Adobe PDF document in the ‘Text over the page image’ mode; and
Convert PDF/Image to Microsoft Word

For the project Making Armenian Scholarly Content Online we selected ‘Searchable PDF Image with the Text under the page image’ mode, which we found to be best for Web presentation of our materials.

Optimising PDFs for Web Delivery

Adobe Acrobat Professional (we are currently using version 6.0) includes a number of built-in tools for optimising PDF files to make them more compact for mounting on the Web. This is very important, since Web documents must be of good quality and also as small as possible. ABBYY’s Fine Reader during OCR is suggesting an option ‘Compress into black & white PDF document’. After conducting a series of trials, we concluded that this option has little effect upon file size, and we decided to not use it. Adobe Acrobat Professional version 6.0 has two built-in tools: ‘Reduce File Size’ (being activated from the File menu) and ‘PDF Optimizer’ (being activated from the Advanced menu). The use of the ‘Reduce File Size’ command produced no effect on our PDF files, due to the small size of those files (none were greater than 300 KB).

In contrast to this, all our files, after OCR processes pass through the ‘PDF Optimizer’. For our documents we are using system default settings, which for our images are appropriate for maximum efficiency. Sometimes we also use the ‘Audit Space Usage’ feature, accessible from the ‘PDF Optimizer’ menu. This provides a report of the total number of bytes used for document elements, including fonts, images, bookmarks, forms, and much more. The space audit results can give an idea of where you might be able to reduce file size. A detailed help on how to use the ‘PDF Optimizer’ can be found from the Adobe Acrobat Professional help file available on Adobe’s Acrobat Users Community site [14].

Conclusion

Techniques described in this article have been tested in the Fundamental Scientific Library of the National Academy of Sciences of Armenia during digitisation of Armenian rare and early printed books and National Academy of Sciences periodicals. All digitised materials are accessible from the Library Web page [15][16]. If the quality of English and Russian scholarly texts after optical character recognition processes is very high, regrettably the same is not true for the Armenian texts. Issues of quality in respect of the latter fall outside the scope of this article, but specific aspects and existing problems relating to optical character recognition of Armenian texts are described in detail in an article by Tigran Zargaryan [17]. The authors hope that approaches described in this article will be helpful for other practitioners to create optimised-quality documents on the Web.

References

British Library: Endangered Archives Programme: Guidelines for copying archival material http://eap.bl.uk/
U.S. National Archives and Records Administration (NARA) Technical Guidelines for Digitizing Archival Materials for Electronic Access: Creation of Production Master Files – Raster Images For the Following Record Types – Textual, Graphic Illustrations/Artwork/Originals, Maps, Plans, Oversized, Photographs, Aerial Photographs, and Objects/Artifacts June 2004 http://old.diglib.org/pubs/dlf103/dlf103.htm (accessed 8 May 2011). Upgraded site holds other formats: http://old.diglib.org/pubs/dlf103/
A RAW file is an image file that contains unprocessed data. Digital Single Lens Reflex (DSLR) cameras and some high-end scanners allow users to capture images in a RAW or native file format that is unique to each manufacturer. After processing or editing and before use, RAW files must be converted to an open standard format such as JPEG or TIFF.
Icam Archive Systems Ltd http://www.icamarchive.co.uk/
BCR’s CDP Digital Imaging Best Practices Working Group, “BCR’s CDP Digital Imaging Best Practices, Version 2.0”, June 2008 http://mwdl.org/public/mwdl/digital-imaging-bp_2.0.pdf
JISC Digital Media (formerly known as TASI): Still Images: Colour Theory: Understanding and Modelling Colour http://www.jiscdigitalmedia.ac.uk/stillimages/advice/colour-theory-understanding-and-modelling-colour/
The collections are available from the library page http://www.flib.sci.am/eng/node/3 by following the links: ‘The Armenian Book in 1512-1800’, ‘The Armenian Book in 1801-1850’. Two repositories: ‘‘The Armenian Book in 1851-1890’ and ‘The Armenian Book in 1901-1200’ are in the process of compilation.
EIFL: Enabling access to knowledge in developing and transition countries http://www.eifl.net/ (Editor’s note: Readers may be interested to note that staff of EIFL have contributed to this issue.)
DCMI Home: Dublin Core Metadata Initiative (DCMI) http://dublincore.org/
The first academic journal published in NAS (at that time the Academy was named as the Armenian Branch of the Academy of Sciences of the USSR) was Bulletin of the Armenian branch of the Academy of Sciences of the USSR. Date of establishment 1940. The list of NAS academic journals, published since 1940, can be found at: http://www.flib.sci.am/eng/About%20NAS%20Journals/About%20NAS%20Journals.html
Astrophysics (established 1965), Reports of the National Academy of Sciences, (1944), Proceedings of the National Academy of Sciences – Earth Sciences series (1948), Proceedings of the National Academy of Sciences – Mathematics series (1966), Proceedings of the National Academy of Sciences – Mechanics series (1966), Reports of the National Academy of Sciences and the State Engineering University of Armenia – Technical Sciences series (1948), Proceedings of the National Academy of Sciences – Physics series (1966), the Herald of Social Sciences (1966), Medical Science of Armenia (1961), Biological Journal of Armenia (1948), Chemical Journal of Armenia (1957), Historical and Philological Journal (1958), Neurochemistry (1982). From 2003 NAS has published New Electronic Journal of Natural Sciences, and from 2008 two electronic Open Access journals: Armenian Journal of Mathematics and Armenian Journal of Physics. See also Tigran Zargaryan, Alan Hopkinson: Scientific publishing in Armenia. European Science Editing, Vol. 35 (2) May 2009.
Yongli Zhou. Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible. Information technology and Libraries, Vol. 29, Number 3, September 2010. pp. 151-159
https://www.ala.org/ala/mgrps/divs/lita/publications/ital/29/3/index.cfm
Preserving our Documentary Heritage UNESCO, Paris 2005
http://portal.unesco.org/ci/en//ev.php-URL_ID=19440&URL_DO=DO_TOPIC&URL_SECTION=201.html
Duff Johnson. “Understanding Acrobat’s Optimizer”. 31 July 2009, AcrobatUsers.com
http://acrobatusers.com/tutorials/understanding-acrobats-optimizer
Fundamental Scientific Library of the National Academy of Sciences (NAS) http://www.flib.sci.am/eng/node/1
Fundamental Scientific Library of the NAS: Knowledge@FSL http://www.flib.sci.am/eng/node/3
Tigran Zargaryan. Specific aspects of digitization of paper materials based on the experience of the works carried out in the Fundamental Scientific Library. In the World of Science, #1, 2011, pp. 21-27, (in Armenian). ISSN 1829-0345.

Author Details

Alan Hopkinson
Technical Manager (Library Services)
Middlesex University
London

Email: a.hopkinson@mdx.ac.uk
Web site: http://www.mdx.ac.uk/

Alan Hopkinson works primarily in a consultancy role leading externally funded projects at Middlesex University Learning Resources. He is currently managing a project to modernise library and information science teaching in Armenia, Georgia and Uzbekistan and leading Middlesex input in improving information literacy in the Balkans as well as modernising library IT infrastructure in Serbia, Montenegro, and Bosnia and Herzogovina. He is a member of IFLA’s Committee on Standards established in 2012.

Tigran Zargaryan
Director
National Library of Armenia

Email: tigran@flib.sci.am
Web site: http://www.flib.sci.am/eng/

Tigran Zargaryan received a Masters in Computer Science from Yerevan State Engineering University (Armenia) and a Ph.D. in Information Science from Moscow Historical Archives Institute. He is currently Director of the National Library of Armenia, Scientific Adviser of the Fundamental Scientific Library of NAS, and Dean of the Library School of the International Scientific Educational Centre of NAS.