Vol. 4, No. 6 June 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and I nformation Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org 545 Optical Character Recognition Techniques: A Survey Sukhpreet Singh M.tech Student, Dept. of Computer Engineering, YCOE Talwandi Sabo BP. India. ABSTRACT This paper presents a literature review on English OCR techniques. English OCR system is compulsory to convert numerous published books of English into editable computer text files. Latest research in this area has been able to grown some new methodologies to overcome the complexity of English writing style. Still these algorithms have not been tested for complete characters of English Alphabet. Hence, a system is required which can handle all classes of English text and identify characters among these classes. Keywords: OCR, templates, English-alphabets, alpha-numerics 1. INTRODUCTION Optical Character Recognition [1] [5] is a process that can convert text, present in digital image, to editable text. It allows a machine to recognize characters through optical mechanisms. The output of the OCR should ideally be same as input in formatting. The process involves some pre-processing of the image file and then acquisition of important knowledge about written text. That knowledge or data can be used to recognize characters. OCR [1] is becoming an important part of modern research based computer applications. Especially with the advent of Unicode and support of complex scripts on personal computers, the importance of this application has increased. The current study is focused on exploration of possible techniques to develop an OCR [2] system for English language when noise is present in the signal. A detailed analysis of English writing system has been done in order to understand the core challenges. Existing OCR systems are also studied to know the latest research going on in this field. The emphasis was on finding workable segmentation technique and diacritic handling for English strings, and built a recognition module for these ligatures. The complete methodology is proposed to develop an OCR system for English and a testing application is also made. Test results are reported and compared with the previous work done in this area. 2. MOTIVATION In the last years the trend to digitize (historic) paper based documents such as books and newspapers, has emerged. The aim is to preserve these documents and make them fully accessible, searchable and process able in digital form. Knowledge contained in paper based documents is more valuable for today’s digital world when it is available in digital form. The first step towards transforming a paper based archive into a digital archive is to scan the documents. The next step is to apply an OCR (Optical Character Recognition) process, meaning that the scanned image of each document will be translated into machine process able text. Due to the print quality of the documents and the error-prone pattern matching techniques of the OCR process, OCR errors occur. Modern OCR processors have character recognition rates up to 99% on high quality documents. Assuming an average word length of 5 characters, this still means that one out of 20 words is defect. Thus, at least 5% of all processed words will contain OCR errors. On historic documents this error rate will be even higher because the print quality is likely to be of lower quality. After finishing the OCR process several post- processing steps are necessary depending on the application, e.g. tagging the documents with meta-data (author, year, etc.) or proof-reading the documents for correcting OCR errors and spelling mistakes. Data which contains spelling mistakes or OCR errors is difficult to process. For example, a standard full-text search will not retrieve misspelled versions of a query string. To fulfill application’s demanding requirements toward zero errors, a post-processing step to correct these errors is a very important part of the post-processing chain. A post-processing error correction system can be manual, semi-automatic or fully automatic. A semi- automatic post-correction system detects errors automatically and proposes corrections to human correctors who then hive to choose the correct proposal. A fully-automatic post-correction system does the detection and correction of errors by its own. Because semi-automatic or manual corrections require a lot of human effort and time, fully-automatic systems become necessary to perform a full correction. 3. DESIGN OF OCR Various approaches used for the design of OCR systems are discussed below: Matrix Matching [6]: Matrix Matching converts each character into a pattern within a matrix, and then compares the pattern with an index of known characters. Its recognition is strongest on monotype and uniform single column pages. Fuzzy Logic [6]: Fuzzy logic is a multi-valued logic that allows intermediate values to be defined