International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 4, April 2012) 590 Word Level Handwritten and Printed Text Separation Based on Shape Features Upasana Patil 1 , Masarath Begum 2 1,2 Department of Computer Science, GND Engineering College, Bidar, India 1 upasana.patil@gmail.com 2 masarath456@gmail.com Abstract— In this paper, we present a method for discriminating handwritten and printed text from document images based on shape features. The separation of handwritten and printed text from document image is essential to optimize the OCR accuracy and to activate an appropriate OCR engine. It leads to reduce the search space of the OCR and it also facilitates the retrieval of Handwritten and Printed text from document images. We have used IAM dataset 3.0 and with morphological transformations segmented 74 pages and obtained 10768 words and 2000 were used for experimentation and achieved average accuracy of 98.57% with only seven features. The proposed method is simple, have promising discrimination accuracy and less time complexity as compared to [10]. Keywords—Document Image Analysis, OCR, Shape Features Handwritten and Printed Text. I. INTRODUCTION Integration of new technologies and inventions leads us towards the achievement of paperless office and paperless society. Document image analysis is one of the import steps in automating the offices. Every activity of the office involves papers, which are in the form of petitions, application forms, reports, letters and accounts. In most of the situations we come across with numerous documents presenting a mixture of handwritten and printed text. For example, railway reservation forms, bank cheques, memorandums etc. Often we notice that interlacing of handwritten and printed text is at word level, line level and paragraph level. The recognition of such documents is a challenging task for OCR designers. To optimize the OCR accuracy, separation of handwritten and printed text from such documents is very essential prior to activation of the OCR engine. Handwritten and printed text separation leads to reduce the search space of the OCR and it also facilitates the retrieval of Handwritten and Printed text documents. Thus, the problem of Automatic Discrimination of between Handwritten and Printed text from Document Images may be addresses in three different cases as they classified as Paragraph Level Separation, Line Level Separation and Word Level Separation. In this paper, word level handwritten and printed text separation is carried out. In Section 2, we present survey of literature. The details of dataset used for experimentation is presented in Section 3. In Section 4, feature extraction methodology is given and Section 5 and 6 describes the classification of text words and results respectively. Conclusion is dawn in Section 7. II. LITERATURE SURVEY There exists some research publications on discrimination of machine printed and handwritten text. Imade et al.[1] extracted the gradient and luminance histogram then applied a neural network to segment a gray level document image in to machine printed character , handwritten character , photograph and painted regions. Kuhnke et al.[2] developed a method for distinction between machine printed and handwritten character images using directional and symmetric features as a input of a neural network. Violante et al.[3] described the method for discriminating between handwritten and printed text, extracting low level features and classify using feed forward multilayer perception neural network. U.Pal and choudhari[4-5] reported scheme for automatic separation of machine printed and handwritten text lines for two Indic scripts. They have used structural and statistical features as nodes of tree for tree classifier. Guo et al.[6] used Hidden Morkov Models (HMM) for extracting handwritten text words from printed text documents. Zheng et al.[7] They have got accuracy of 96% using SVM classifier and using filters like Gabor and run length histogram features etc. They further improved their result to 98.10% by implementing a Markov Random field based post processing step. Ergina K et al.[8] they present a trainable approach to discriminate between machine printed and handwritten text using simple structural characteristics and discriminant analysis for classification.