Vol. 4, No. 6 June 2013 ISSN 2079-8407
Journal of Emerging Trends in Computing and I nformation Sciences
©2009-2013 CIS Journal. All rights reserved.
http://www.cisjournal.org
545
Optical Character Recognition Techniques: A Survey
Sukhpreet Singh
M.tech Student, Dept. of Computer Engineering, YCOE Talwandi Sabo BP. India.
ABSTRACT
This paper presents a literature review on English OCR techniques. English OCR system is compulsory to convert
numerous published books of English into editable computer text files. Latest research in this area has been able to grown
some new methodologies to overcome the complexity of English writing style. Still these algorithms have not been tested
for complete characters of English Alphabet. Hence, a system is required which can handle all classes of English text and
identify characters among these classes.
Keywords: OCR, templates, English-alphabets, alpha-numerics
1. INTRODUCTION
Optical Character Recognition [1] – [5] is a
process that can convert text, present in digital image, to
editable text. It allows a machine to recognize characters
through optical mechanisms. The output of the OCR
should ideally be same as input in formatting. The
process involves some pre-processing of the image file
and then acquisition of important knowledge about
written text.
That knowledge or data can be used to recognize
characters. OCR [1] is becoming an important part of
modern research based computer applications. Especially
with the advent of Unicode and support of complex
scripts on personal computers, the importance of this
application has increased.
The current study is focused on exploration of
possible techniques to develop an OCR [2] system for
English language when noise is present in the signal. A
detailed analysis of English writing system has been done
in order to understand the core challenges. Existing OCR
systems are also studied to know the latest research going
on in this field. The emphasis was on finding workable
segmentation technique and diacritic handling for English
strings, and built a recognition module for these ligatures.
The complete methodology is proposed to develop an
OCR system for English and a testing application is also
made. Test results are reported and compared with the
previous work done in this area.
2. MOTIVATION
In the last years the trend to digitize (historic)
paper based documents such as books and newspapers,
has emerged. The aim is to preserve these documents and
make them fully accessible, searchable and process able
in digital form. Knowledge contained in paper based
documents is more valuable for today’s digital world
when it is available in digital form.
The first step towards transforming a paper
based archive into a digital archive is to scan the
documents. The next step is to apply an OCR (Optical
Character Recognition) process, meaning that the scanned
image of each document will be translated into machine
process able text. Due to the print quality of the
documents and the error-prone pattern matching
techniques of the OCR process, OCR errors occur.
Modern OCR processors have character recognition rates
up to 99% on high quality documents. Assuming an
average word length of 5 characters, this still means that
one out of 20 words is defect. Thus, at least 5% of all
processed words will contain OCR errors. On historic
documents this error rate will be even higher because the
print quality is likely to be of lower quality.
After finishing the OCR process several post-
processing steps are necessary depending on the
application, e.g. tagging the documents with meta-data
(author, year, etc.) or proof-reading the documents for
correcting OCR errors and spelling mistakes. Data which
contains spelling mistakes or OCR errors is difficult to
process. For example, a standard full-text search will not
retrieve misspelled versions of a query string. To fulfill
application’s demanding requirements toward zero errors,
a post-processing step to correct these errors is a very
important part of the post-processing chain.
A post-processing error correction system can be
manual, semi-automatic or fully automatic. A semi-
automatic post-correction system detects errors
automatically and proposes corrections to human
correctors who then hive to choose the correct proposal.
A fully-automatic post-correction system does the
detection and correction of errors by its own. Because
semi-automatic or manual corrections require a lot of
human effort and time, fully-automatic systems become
necessary to perform a full correction.
3. DESIGN OF OCR
Various approaches used for the design of OCR
systems are discussed below:
Matrix Matching [6]: Matrix Matching
converts each character into a pattern within a matrix, and
then compares the pattern with an index of known
characters. Its recognition is strongest on monotype and
uniform single column pages.
Fuzzy Logic [6]: Fuzzy logic is a multi-valued
logic that allows intermediate values to be defined