www.ijcait.com International Journal of Computer Applications & Information Technology Vol. I, Issue III, November 2012 (ISSN: 2278-7720) Page | 1 A-Survey of Feature Extraction and Classification Techniques in OCR Systems Rohit Verma School of Information Technology, Apeejay Institute of Management Technical Campus, Jalandhar, India Abstract: - This paper describes a set of feature extraction and classification techniques, which play very important role in the recognition of characters. Feature extraction provides us methods with the help of which we can identify characters uniquely and with high degree of accuracy. Feature extraction helps us to find the shape contained in the pattern. Although a number of techniques are available for feature extraction and classification, but the choice of an excellent technique decides the degree of accuracy of recognition. A lot of research has been done in this field and new techniques of extraction and classification has been developed. The objective of this paper is to review these techniques, so that the set of these techniques can be appreciated. Keywords: OCR, Features Extraction. 1. Introduction Optical Character Recognition is a most important gift given by computer science to the mankind. It has made a lot of tedious work easy and speedy. OCR means a technique of recognition of machine printed or hand written text by computer and then its conversion to an editable form as per the requirement. The various phases of OCR technique are: a) Digitization b) Pre-processing c) Segmentation d) Feature Extraction & Classification e) Post processing In the Digitization phase the printed or hand written data is converted into the digital form either by scanning the given document or by writing with a digitizer connected with an LCD. The second phase is the Pre-processing phase. In this phase the basic operations performed are: Binarization, Contour Smoothing, Noise reduction, Skew Detection and finally Skeletonization of the digital image. In the segmentation phase the image of sequence of characters is decomposed into lines, words and then characters. The feature extraction phase analyses the given text segment and selects a set of features that can be used to uniquely identify the given text segment. In the classification section of this phase, features extracted in Dr. Jahid Ali Sri Sai Iqbal College of Management & Information Technology , Badhani, Pathankot, India the feature extraction section are used to identify the text on the basis of pre-decided rules. The output of the OCR system generally contains errors; the debugging responsibility is performed by the last phase that is Post processing phase. Thus all phases of the OCR system are very important, but the most important phase which is responsible for character recognition is Feature Extraction and Classification. 2. Feature Extraction Different characters have different features; on the basis of these features the characters are recognized. Thus Feature extraction can be defined as the process of extracting differentiating features from the matrices of digitized characters. A number of features have been found in literature on the basis of which the OCR system works to recognize the characters. According to C. Y. Suen (1986), Features of a character can be classified into two classes: Global or statistical features and Structural or topological features. Global or Statistical Features Global features are obtained from the arrangement of points constituting the character matrix. These features can be easily detected as compared to topological features. Global features are not affected too much by noise or distortions as compared to topological features. A number of techniques are used for feature extraction; some of these are: moments, zoning, projection histograms, n-tuples, crossings and distances. i) Moments In this case the moments of different points present in a character are utilized as a feature. Heutte et. al (1998) said that these are most commonly used methods in character recognition . M K Hu (1962) brought the concept of classical moment invariants. Hu’s other moments are statistical measure of the pixel distribution about the centre of gravity of the character. S. S. Reddi (1981) proposed Radial and angular moments where as Zernike moments were proposed by Teh and Chin (1988). Similarly Suen et. al. (1980) suggested that characters can also be identified by considering the distance of various pixels present in the character matrix, from the centroid of the character. These are called central moments.