OCR Challenges on Historic Documents

Below is a short list of reasons why OCR on historic documents is a real challenge.

Image Quality

  • Old documents are hard to scan, but good scan quality is important for good OCR results. Problems that you may encounter:
    • Curled paper
    • Pages are stuck together
    • Wired layouts
    • Curved lines of text when the book has to be treated carefully

Layout detection

  • Historic books/documents often have a different layout structure.
    Accordingly algorithms that were designed for “modern” layouts might not be able to deliver proper results on these layouts
  • Old newspapers can also be very tricky :-(
    • Small Fonts
    • Complex Layouts
    • Reading order

Texttypes Used

  • Old font types are used - standard character recognisers cannot read gothic/fraktur fonts
  • Quality of the characters that should be OCRed is often very bad
    • Broken characters
    • Mixed with noise and dirt or writing
  • There are characters in old documents that are not available in modern computer fonts

Language-Issues

  • Historically spelling was not unified and consequently there are many different writing variants
  • There are no historic dictionaries available

Further Information:

Sources: Images were taken from the linked presentations