Efficient Solutions for OCR Text Remote Correction in Content Conversion Systems

Costin-Anton Boiangiu, Alexandru Topliceanu, Ion Bucur

Abstract


This paper describes a collection of algorithms for detecting text areas in document images using morphological operators, text clustering using geometrical text measurements and efficient image coding for fast remote correction in automatic content conversion systems Text characteristics are automatically discovered and used to filter out all non-text areas in the image. All the algorithms were implemented and tested on a representative set of test images obtained by scanning newspapers, books and magazines. The document image page clustering uses a measure of normalized text font resemblance. The approach makes use solely of the geometrical characteristics of characters, ignoring information regarding context or character recognition.

Full Text: PDF