Automatic discrimination between printed and handwritten text in documents Lincoln Faria da Silva, Aura Conci Instituto de Computac ¸˜ ao Universidade Federal Fluminense - UFF Niter´ oi, Brasil {lsilva, aconci}@ic.uff.br Angel Sanchez Departamento de Ciencias de la Computacion Universidad Rey Juan Carlos, Madrid, Spain angel.sanchez@urjc.es Abstract—Recognition techniques for printed and handwritten text in scanned documents are signiﬁcantly different. In this paper we address the problem of identifying each type. We can list at least four steps: digitalization, preprocessing, feature extraction and decision or classiﬁcation. A new aspect of our approach is the use of data mining techniques on the decision step. A new set of features extracted of each word is proposed as well. Classiﬁcation rules are mining and used to discern printed text from handwritten. The proposed system was tested in two public image databases. All possible measures of efﬁciency were computed achieving on every occasion quantities above 80%. Keywords-Data Mining; document analysis; text identiﬁcation; optical characters recognition; Machine Vision I. I NTRODUCTION Great number of applications use documents presenting printed text and handwriting. Old documents, petitions, re- quests, applications for college admission, letters, require- ments, memorandums, envelopes and bank checks are some examples. A considerable obstacle to optical character recog- nition (OCR) systems is the mixture of printed and handwritten text in the same image. Each text type should be processed using different methods in order to optimize the recognition accuracy. Previous works addressed the problem of identifying each type by various classiﬁcation techniques. These works utilize neural networks [1-7], employ linear polynomial for discrim- ination function [8], Fisher [9-12] and tree classiﬁers [13- 14], Hidden Markov Model (HMM) [15] or minimal distance classiﬁers [16-17]. In this paper we propose the use of clas- siﬁcation rules mining by the WEKA tool [23]. This enables us to visualize best rules from a group of possible classiﬁers using features extracted from each word of the document. The main advantage of this, compared with other classiﬁers, is its accuracy, efﬁciency, simplicity and the low computation complexity. When the classiﬁcation is performed by words and not by line, it is possible to analyse more complex pages which mix in the same line both type of characters. However, all documents to be classiﬁed are supposed to be aligned with the scanner. The implemented system is concerned with documents presenting adequate orientation on the acquisition step, not to skewed one. Document image is ﬁrstly preprocessed by various tech- niques. Then the text is segmented at word level when each word is surrounded by a bounding box (BB). Afterward features are extracted from these BBs. The classiﬁcation rules decide whether a BB contains printed or handwritten text. Two public image databases are used to verify the implemented system. Both present very satisfactory results permitting eval- uation of its robustness. This paper is organized as follows: In section 2 each step of the overall system is presented. Section 3 considers the training, tests and results. Finally, section 4 summarizes the conclusions and future improvements. II. THE PROPOSED APPROACH This section presents the proposed system. It describes the type of document processed, the applied image process- ing techniques, the segmentation of the text in words, the extracted features from these words, and the classiﬁcation process executed by the system. Figure 1 shows an overview of the system. It has four main steps: preprocessing, text segmentation in words, feature extraction and classiﬁcation. A. Document types The developed system considers application forms for var- ious objectives, such as subscription forms, research ques- tionnaires or preprinted memorandums. Blank regions, lines, printed and handwritten words can be found all over these documents. However, they do not present logos, ﬁgures, tables, graphs or another type of element. Figure 2 shows an example of possible images to be processed. Note that for systems performing classiﬁcation at line level it is not possible the combination of written and printed in the same line of the documents. Fig. 1. Overview of the system