Sammlung von Webcasts zum Thema OCR für historische Texte

Verbesserungen der ABBYY-Technologie für IMPACT

  • Die aufgezeichnete Präsentation erklärt die Verbesserungen, die ABBYY im Rahmen des IMPACT Projektes in seiner OCR-Technologie vorgenommen hat

IMPACT Workshop : 03. ABBYY and IMPACT OCR Introduction and Improvements by Michael Fuchs
from LITIS Laboratory on Vimeo.

OCR in Bibliotheken – einige praktische Anmerkungen

  • Veranstaltung: IMPACT, Bratislava Mai 2010
  • Sprecher: Günter Mühlberger, Universität Innsbruck

OCR in libraries – some practical remarks - Günter Mühlberger (UIBK) from IMPACT Project on Vimeo.

Quelle: http://impactocr.wordpress.com/page/4/ (Seite kann sich im Laufe der Zeit ändern - bitte in diesem Fall einfach zu älteren Inhalten wechseln]

Next, Günter Mühlberger of Universitäts- und Landesbibliothek Tirol (University and Regional Library Tyrol) talks about the situation with regard to full-text digitisation in Europe and elsewhere. He notes that in Europe, OCR has not always been a standard part of the workflow, so there is much legacy digital material that has never been OCR’d – and more importantly not created for OCR. He notes, however, that Americans have been more proactive in this field, citing JSTOR and Google Books as examples.

There are a few reasons for this, among them the relative unsophistication of even fairly recent OCR systems, but also that a project that involves OCR breeds complications. Put simply: once you’ve made your OCR, what do you do with it? How do you assure its quality? How do you expose it to the public, if at all? Günter gives an example from Austrian Literature Online.

He then outlines the pros and cons of the three major sources of text-based digital material: bound volumes, microfilms and loose folios. All can be done well, but digitisation managers need to focus on producing not just a “good” image, but an image that’s good for OCR.

The characteristics of a good image for OCR are overall sharpness, distinct fonts, a clear background, a complete shot with a white frame around the side for OCR orientation. All lines should be parallel to each other and to page margins. No additional noise (or marginalia) from users. Not all of these things are easy to control, but the existence of problematic features could be used to guide material to be digitised.

Some general recommendations: if it’s a modern, clean document, it can be captured at 300dpi as a bitonal JPEG; if it’s an older document with problematic features, then greyscale 400ppi TIFF is preferable. Günter concludes by saying that OCR is a must for a digital library text collection.
Niall Anderson, The British Library + Mark-Oliver Fischer, Bavarian State Library

Optical Character Recognition – Einführung und Überblick

  • Veranstaltung: IMPACT, Bratislava Mai 2010
  • Sprecher: Michael Fuchs, ABBYY Europe

Optical Character Recognition (OCR) – introduction & overview - Michael Fuchs (Abbyy) from IMPACT Project on Vimeo.

Michael Fuchs of ABBYY starts by explaining the work of his company within IMPACT: ABBYY provides the other partners with access to its FineReader SDK, and uses the experiments and digital text material within IMPACT to hone and test its own products and technology. He identifies Fraktur/Gothic script as being a particular difficulty for state of the art OCR engines.

Michael explains the difficulties with recognition technology with reference to Captcha sites: the text strings in Captcha windows look remarkably like some historical documents – exhibiting such characteristics as warp, curl, different fonts, Gothic script. The irony is that Captchas are designed to be machine-unreadable, and yet these are exactly the sort of characters that Abbyy and IMPACT would like to improve OCR recognition for.

He goes on to explain how some of these characteristics come to be exhibited in historic text documents: bad scanning, preprocessing of an image, generic binarisation, colour artefacts, etc. He then explains how Abbyy technology attempts to get around these proplems: through adaptive (image-sensitive) binarisation, structural analysis of a document image, and character classification of different types – including the ability to “train” the OCR engine to particular types of language and font.

To conclude, Michael identifies five key areas in which OCR needs to improve:

  • better context-sensitive binarisation and preprocessing;
  • more comprehensive document analysis, perhaps focussing on optimised character patterns
  • “Adaptive” OCR and the creation of special dictionaries
  • better validation and correction systems, including the mass verification of OCR results
  • better document synthesis and export, relying on a standard like xml as a language in which mistakes and misrecognitions can be analysed

Niall Anderson, The British Library + Mark-Oliver Fischer, Bavarian State Library



Zurück zu: Technologie & SupportProduktüberblick