OCR Challenges on Historic Documents

Below is a short list of reasons why OCR on historic documents is a real challenge.

Image Quality

  • Old documents are hard to scan, but good scan quality is important for good OCR results. Problems that you may encounter:
    • Curled paper
    • Pages are stuck together
    • Wired layouts
    • Curved lines of text when the book has to be treated carefully

Layout detection

  • Historic books/documents often have a different layout structure.
    Accordingly algorithms that were designed for “modern” layouts might not be able to deliver proper results on these layouts
  • Old newspapers can also be very tricky :-(
    • Small Fonts
    • Complex Layouts
    • Reading order

Texttypes Used

  • Old font types are used - standard character recognisers cannot read gothic/fraktur fonts
  • Quality of the characters that should be OCRed is often very bad
    • Broken characters
    • Mixed with noise and dirt or writing
  • There are characters in old documents that are not available in modern computer fonts

Language-Issues

  • Historically spelling was not unified and consequently there are many different writing variants
  • There are no historic dictionaries available

Further Information:

Sources: Images were taken from the linked presentations

This website uses cookies which enable you to see pages or use other functions of our websites. You can turn off such cookies in your browser’s settings. If you continue to use these pages, you consent to the use of cookies.