IMPACT Conference: Optical Character Recognition in Mass Digitisation

michael day; lieke ploeger; yola park; jeanna nikolov-ramirez gaviria; clemens neudecker; fedor bochow

IMPACT Conference: Optical Character Recognition in Mass Digitisation

Lieke Ploeger, Yola Park, Jeanna Nikolov-Ramirez Gaviria, Clemens Neudecker, Fedor Bochow and Michael Day report from the first IMPACT Conference, held in The Hague, Netherlands on 6-7 April, 2009.

The first conference of the IMPACT (Improving Access to Text) Project was held at the Koninklijke Bibliotheek, National Library of the Netherlands (KB) in The Hague on 6 and 7 April 2009. A total of 136 participants from over 30 countries attended. The main focus of the event was on Optical Character Recognition (OCR) technologies and their use in supporting the large-scale digitisation of historical text resources. It was also an opportunity to introduce the IMPACT Project to a wider audience and to describe some of its initial results.

IMPACT is a European project that aims to speed up the process of and enhance the quality of mass text digitisation in Europe. The IMPACT research programme aims to improve significantly digital access to historical printed text through the development and use of innovative Optical Character Recognition software and linguistic technologies. IMPACT will also help to build capacity in mass digitisation across Europe. This first conference, focused on current and future challenges for OCR technology in mass digitisation projects, was intended as a means of exchanging views with other researchers and commercial providers in the OCR field, as well as presenting some preliminary results from the first year of the IMPACT Project.

A more detailed report from this conference together with links to presentation slides will be made available from the IMPACT Web site [1].

Monday, 6 April 2009

The conference was formally opened by Hans Jansen, Director of e-Strategy at the KB and chair of the IMPACT General Assembly. After welcoming all participants he expressed his appreciation of the KB being able to host a valuable opportunity for experts from all over the world - representing libraries, research institutes, software suppliers and service providers - to meet and discuss the challenges of mass digitisation and their possible solutions.

IMPACT Contexts

Patricia Manson, head of the European Commission’s Cultural Heritage and Technology Enhanced Learning Unit (part of the Information Society and Media Directorate General), gave the opening presentation entitled Digitisation of Cultural Resources: European Actions and the Context of IMPACT. In this, Manson located the IMPACT Project within the general context of European policies and strategies in the digitisation domain. She mentioned that the digitisation research that IMPACT was undertaking represented just one part of an integrated set of activities focused on implementing the European Commission’s i2010 Digital Libraries Initiative [2], which plans to transform Europe’s printed heritage into digitally available resources. The challenges involved in this initiative, however, are numerous. Apart from the key issues of copyright and intellectual property rights, there remains the need to improve the cost-effectiveness of digitisation through the development of improved technologies and tools, and to expand competence in digitisation across Europe’s cultural institutions. Manson argued that there was a call for (virtual) centres of competence that would aim to exploit the results of research and to leverage national and other initiatives. It was envisaged that IMPACT would become a centre of competence for the digitisation of printed textual material.

Hildelies Balk, head of the European Projects section within the KB’s Department of Research and Development and IMPACT project manager, then provided a short introduction to the IMPACT Project. The presentation began by describing the project’s background: the technical and strategic challenges that are still holding back the mass digitisation of historical printed text. She commented that the current state-of-the-art in Optical Character Recognition does not produce satisfactory results for historical documents, while there is also a lack of institutional knowledge and expertise, causing inefficiency and duplication of effort. The libraries, universities, research centres and industry partners that make up the IMPACT consortium hope to contribute to solving this by: innovating OCR software and language technology; sharing expertise and building capacity across Europe; and ensuring that tools and services will be sustained after the end of the project. All IMPACT tools and services are focused on reducing effort and enhancing the speed and output of mass digitisation programmes, and are firmly grounded in the needs of the libraries. After a short overview of these IMPACT tools, Dr Balk concluded with the vision that, from 2012 onwards, the project would form a sustainable centre of competence for the mass digitisation of historical text in Europe. She commented that this centre would exist for as long as it is necessary to fulfil the ultimate aim: “All of Europe’s historical text digitised in a form that is accessible, on a par with born digital documents.” She noted that a parallel session on the second day of the conference would focus in more detail on the nature of the centre of competence concept.

Library Challenges for Mass Digitisation

Astrid Verheusen, head of the Digitisation Department at the Koninklijke Bibliotheek, gave an overview of mass digitisation challenges from a library perspective. After a short description of past digitisation efforts at the KB, which were primarily small-scale projects focused on the production of visually attractive images, Verheusen argued that the mass digitisation of textual content requires a different approach. The large-scale nature of mass digitisation means that libraries have to deal with significant challenges, e.g. costs, heavy demands on technical and organisational infrastructure, and problems with the insufficient quality of OCR results. Library initiatives are part of a wider framework of digitisation programmes (including Google Book Search or the Internet Archive), but it was argued that libraries have an important role to play with regard to issues like quality, completeness, long-term preservation, free availability and copyright. She maintained, therefore, that the main challenges libraries will need to overcome include making the digitisation process more efficient (e.g., through automated processes) and improving the quality of mass digitisation - in which the IMPACT Project promises to play an important role. Finally, Verheusen described some possible solutions currently being implemented at the KB in its development of its digital library. These include: a focus on digitising complete collections rather than selected resources; the use of the JPEG2000 format to save storage space [3]; the development of automated quality assurance tools; and increased cooperation between the different digitisation projects at the KB to prevent the duplication of effort. Verheusen commented that the KB hopes that IMPACT will provide tools to improve the quality of mass digitisation of historical text and that the project will also help libraries and other cultural heritage institutions to share knowledge through such means as guidelines and training, a helpdesk and Web site.

IMPACT Tools (Part 1)

This afternoon session started with three short presentations on the improvement of OCR, based on results from the first year of the IMPACT Project. First, Asaf Tzadok of IBM Haifa Research Laboratory introduced the concept of Adaptive OCR. He noted that the current generation of commercial OCR engines focus primarily on the recognition of diverse materials using modern fonts, while collections of historical texts usually involve relatively large bodies of homogenous material using older font types. The principle underlying Adaptive OCR is that it may be possible to improve OCR capabilities by creating an OCR engine that would ‘tune’ itself to each work being processed. The IMPACT Adaptive OCR engine works on large bodies of homogenous material, and adapts itself to the specific font, dictionaries and words in a given body of text. In addition, a ‘Collaborative Correction’ module groups together ‘suspicious’ characters and words that users are able to correct through an online web-based application. The Adaptive OCR engine will then receive the results of this manual correction and use them in order to improve the recognition rates even further.

Basilis Gatos, a researcher at the Institute of Informatics and Telecommunications of the National Center for Scientific Research “Demokritos” in Athens, then gave a brief overview of the IMPACT enhancement and segmentation platform. This helps the user to evaluate current state-of-the-art techniques for enhancement and segmentation, as well as integrating several new IMPACT toolkits - first versions of the toolkits designed for dewarping, border removal and character segmentation. The user can select a methodology for each enhancement / segmentation stage and produce not only the result of either enhancement or segmentation but also results from the intermediate stages. In the future this platform will also be used to test the portability of the new IMPACT toolkits. Dr Gatos concluded with a short demonstration of the platform.

Klaus Schulz, technical director of the Centrum für Informations- und Sprachverarbeitung (CIS) of the University of Munich, gave the final short presentation on Language Technology for Improving OCR on Historical Texts. Professor Schulz discussed how language technology can help to improve or correct OCR results on historical texts. He commented that traditional techniques for OCR correction were not suitable for historical text, because they use dictionaries that do not contain the large amount of spelling variants present in such texts. Professor Schulz then demonstrated how the use of special historical language dictionaries in OCR engines affects OCR quality. For example, combining a modern and a virtual dictionary could reduce the word error rate for 18th century texts by 42%. Another option for improving OCR results would be to use knowledge on document and OCR behaviour. When “profiling” the OCR output in a fully automated way, the intended profiles can try to detect the base language, special vocabulary, typical spelling variants of the underlying text as well as typical OCR errors found in the output. Profiles like this could then be used for post correction or for improving OCR output on a second run. For example, knowledge of typical historical spelling variants in a document could lead to a better selection of dictionaries. Although this method is quite ambitious, Professor Schulz argued that profiling OCR output in terms of vocabulary, language variants and OCR errors seems a good basis for further improving OCR results.

OCR Accuracy in a Newspaper Archive

Simon Tanner, Director of King’s Digital Consultancy Services (KDCS) in the Centre for Computing in the Humanities (CCH) at King’s College London, gave a presentation on Measuring the OCR Accuracy across The British Library Two Million Page Newspaper Archive. This presentation focused on the methodology developed by KDCS and Digital Divide Data (DDD) for evaluating OCR accuracy, based on the actual XML output in relation to the original images. Since character accuracy rates cannot be used to say anything about word accuracy with too much certainty, this method also takes word accuracy and significant word accuracy into consideration. In this way it will also become possible to assess the functionality that the OCR output will support, such as search accuracy and the ability to structure searches and results. Tanner demonstrated this method with examples from the OCR accuracy study of the British Library’s 19th Century Newspaper Project. The study selected pages from the newspaper archive, and two sections of each page were double re-keyed. This text was then compared to the XML text generated by OCR, and the results analysed with the metrics created by KDCS. Two-thirds of the newspaper content turned out to have a character accuracy of over 80%. However, only half of the titles had such a figure for word accuracy, and only a quarter of the titles had over 80% significant word accuracy. In response to a question about the acceptable word accuracy rate, Tanner replied that, as a rule of thumb, a word accuracy rate above 80% would be considered acceptable. At that level, most ‘fuzzy’ search engines would be able to fill in the gaps sufficiently (or find related words), meaning that a high search accuracy (>95-98%) would still be possible. By comparing the number of repeated significant words and then measuring the accuracy against the OCR results, it would thus be practicable to assess the search accuracy effectively. Tanner argued that this method could be utilised for assessing OCR performance and thus as a means of making better OCR decisions.

Panel Discussion on OCR Challenges

The final session of the day was a panel discussion on OCR Challenges, moderated by Günter Mühlberger, Head of the Department for Digitisation and Digital Preservation at Innsbruck University Library. Panellists included all of the speakers in the previous afternoon sessions (Tzadok, Gatos, Schulz, Tanner), but they were also joined by Astrid Verheusen, Claus Gravenhorst of CCS Content Conversion Specialists GmbH., and Jupp Stoepetie of ABBYY. The discussion was very lively and wide-ranging and included topics such as: the question of outsourcing or digitising in-house; the problem of ‘legacy’ content; benchmarks for OCR accuracy; lessons from the digitisation of non-European character sets; the role of greyscale scanning; and the need for digitisation tools like those being developed by IMPACT to be interoperable. A fuller transcription of the discussion will be available in the conference proceedings.

Tuesday, 7 April 2009

The second day of the conference started with a short summary of the conference so far by Aly Conteh of the British Library. The morning sessions were based on two talks from external experts.

Collaborative OCR Text Enhancement at the National Library of Australia

The opening presentation was given by Rose Holley, manager of the Australian Newspaper Digitisation Program at the National Library of Australia (NLA). Her presentation, entitled Many Hands Make Light Work: Collaborative OCR Text Correction in Australian Historic Newspapers, focused on the collaborative Australian Newspaper Digitisation Program. This intends to create a service that will provide free online access to Australian newspapers, from the first newspaper published in Australia in 1803 through to the end of 1954, including full text searchability [4]. In July 2008, 360,000 pages were made available to the public in a beta version of an interface that permits collaborative OCR correction. After a short demonstration, Holley described the initial outcomes of the online correction system. She noted that without almost any publicity, a large user-base had already formed. It appeared, therefore, that end-users are potentially very interested in correcting text, whether for helping to improve the record of Australian history or to support genealogical research. While there has been no moderation (so far) by programme staff, no vandalism of text had yet been observed. Holley observed that giving users a high-level of trust has resulted in commitment and loyalty. For example, accidental mistakes are often quickly corrected by other users. A study of user feedback showed that the high level of trust invested in users was one of the main motivation factors, along with the user’s own short and long term goals (for example family history), and the focus on the outcome of improving the record of Australian history. Holley noted that the collaborative OCR text correction system has shifted some of the power and control over data - which is traditionally held by libraries or other cultural heritage organisations - back to the community, and that this had turned out surprisingly well. The challenge the National Library of Australia now faces is to sustain this virtual community in the future.

Technical OCR Challenges

The second presentation was an assessment of future OCR challenges given by Claus Gravenhorst, Director Strategic Initiatives at CCS Content Conversion Specialists GmbH. The presentation started with a short description of the history of OCR technology from the Kurzweil age onwards. He commented that despite the amount of research already undertaken and the fact that 21st century OCR has reached quite a high level of character recognition, there is still a considerable lag behind human performance; for example, in the familiar and problematic areas of image quality (e.g., because of unsatisfactory scanning methods or poor print quality), difficult layouts, the use of historic fonts, etc. He argued that an interesting approach to solving these problems might be to look at complete documents, instead of single pages, and to analyse structural information for improving OCR results. Such structural information could also help to recognise logical entities automatically such as headings, captions, etc. Since some of these entities tend to be more important for retrieval than running text, OCR correction could then focus principal attention on these elements. In the era of mass digitisation, a ‘next-level OCR’ such as this is urgently needed to increase digitisation speeds and help to lower hardware costs.

IMPACT Tools (Part 2)

The final afternoon of the conference began with three short talks on selected achievements of IMPACT from the first year of the project. The first presentation, by Apostolos Antonacopoulos of the Pattern Recognition and Image Analysis (PRImA) research group at the University of Salford, introduced IMPACT work undertaken on Digital Restoration and Layout Analysis. The presentation outlined methods explored by IMPACT for solving problems like geometric correction, border removal and binarisation. He showed practical examples of these problems from scans of historical text and suggested possibilities for improvement. In addition, one of the aims of IMPACT is to create and maintain a common baseline for evaluating different approaches to mass digitisation. For this, IMPACT will create a common dataset, with ground truth at various levels, that is both representative of the library collections and of mass digitisation challenges. It will also define evaluation metrics and scenarios with the appropriate tools to implement them.

Katrien Depuydt, head of the language database department at the Institute for Dutch Lexicology (INL) in Leiden, gave the next presentation on Historical Lexicon Building and How it Improves Access to Text. She outlined IMPACT work on overcoming historical language barriers by building historical lexica. These lexica are intended to supplement the basic OCR word lookup for specific use with historical texts. IMPACT will deliver a set of tools for the efficient production of such lexica with guidelines. After some discussion of linguistic issues in building computational lexica of historical language, Depuydt provided some insight into the lexicon building process by showing the IMPACT attestation tool, which is used to verify attestations in historical dictionaries. Finally, she outlined how computational lexica of historical language can overcome the historical language barrier in retrieval.

Finally, Neil Fitzgerald, IMPACT Delivery Manager at The British Library, gave a short overview of the first iteration of the IMPACT decision support tools, a collection of documents providing digitisation workflow support and guidance. They are based on the real-world experience of project partners. The decision support tools are intended to help direct project learning through a variety of means to the wider community. Fitzgerald argued that it was IMPACT’s vision to focus on support for practical implementation, both for the in-house and contracted-out elements of digitisation. This is why the decision support tools contain case studies from the various project partners which deal with common issues such as complex layouts and gothic fonts, as well as links to appropriate external resources. In later iterations, feedback from the wider cultural heritage communities will also be included.

Parallel Sessions

The afternoon concluded with three parallel sessions. The first was an open discussion of IMPACT as a centre of competence. Hildelies Balk gave a general presentation and Aly Conteh moderated a wide-ranging discussion that demonstrated that there was a great deal of interest in such a concept. The second parallel session focused on technical matters, with Apostolos Antonacopoulos and Stefan Pletschacher of the University of Salford moderating a discussion on Challenges and Opportunities in Mass Digitisation: How Technology Can Meet Libraries’ Needs. The final option was a guided tour of digitisation activities at the KB led by Edwin Klijn. More detailed reports from the two discussion sessions will also be made available from the IMPACT Project Web site.

Conclusions

Aly Conteh and Hildelies Balk concluded the day with a final summary, giving thanks to certain key people involved in its organisation. The conference was a great opportunity for the IMPACT Project to engage with stakeholders from cultural heritage organisations, commercial digitisation providers, and the research community. The mix of presentations and discussion was a useful way for project participants to share with interested parties the current state of the art and to discover what IMPACT was doing. It was also a useful opportunity to remind delegates of the existing digitisation challenges that remain to be solved.

References

IMPACT (Improving Access to Text)
http://www.impact-project.eu
i2010 Digital Libraries Initiative
http://ec.europa.eu/information_society/activities/digital_libraries/index_en.htm
Robèrt Gillesse, Judith Rog, Astrid Verheusen, Alternative File Formats for Storing Master Images of Digitisation Projects, v. 2.0. Den Haag: Koninklijke Bibliotheek, March 2008.
http://www.kb.nl/hrd/dd/dd_links_en_publicaties/links_en_publicaties_intro.html
Rose Holley, “How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs.” D-Lib Magazine, Vol. 15, no. ³⁄₄, March/April 2009.
http://www.dlib.org/dlib/march09/holley/03holley.html

Author Details

Lieke Ploeger
Research and Development Department
Koninklijke Bibliotheek (National Library of the Netherlands)
The Hague
Netherlands

Email: Lieke.Ploeger@KB.nl
Web site: http://www.kb.nl/

Yola Park
Research and Development Department
Koninklijke Bibliotheek (National Library of the Netherlands)
The Hague
Netherlands

Email: Yola.Park@KB.nl
Web site: http://www.kb.nl/

Jeanna Nikolov-Ramírez Gaviria
Abteilung für Forschung und Entwicklung
Österreichische Nationalbibliothek (Austrian National Library)
Vienna
Austria

Email: jeanna.nikolov@onb.ac.at
Web site: http://www.onb.ac.at/

Clemens Neudecker
Münchener Digitalisierungszentrum (MDZ)
Bayerischen Staatsbibliothek (Bavarian State Library)
Munich
Germany

Email: neudecker@bsb-muenchen.de
Web site: http://www.bsb-muenchen.de/

Fedor Bochow
Münchener Digitalisierungszentrum (MDZ)
Bayerischen Staatsbibliothek (Bavarian State Library)
Munich
Germany

Email: Fedor.Bochow@bsb-muenchen.de
Web site: http://www.bsb-muenchen.de/

Michael Day
UKOLN, University of Bath
United Kingdom

Email: m.day@ukoln.ac.uk
Web site: http://www.ukoln.ac.uk/

Return to top