Date Field Extraction in Handwritten Documents Ranju Mandal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata-108, India ranjumandal@gmail.com Partha Pratim Roy Laboratoire d’Informatique Université François Rabelais Tours, France partha.roy@univ-tours.fr Umapada Pal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata-108, India umapada@isical.ac.in Abstract Automatic extraction of date patterns from handwritten document involves difficult challenges due to writing styles of different individuals, touching characters and confusion among identification of alphabets and digits. In this paper, we propose a framework for retrieval of date patterns from handwritten documents. The method first classifies word components of each text line into month and non-month class using word level feature. Next, non-month words are segmented into individual components and classified into one of alphabet, digit or punctuation. Using this information of word and character level components, the date patterns are searched first using voting approach and then we detect the candidate lines for numeric and semi-numeric date using regular expression. Gradient based features and Support Vector Machine (SVM) are used in our work for classification. The experiment is performed on handwritten dataset and we have obtained encouraging results from it. I. INTRODUCTION Date is useful and important information that could be used as key for searching and indexing of handwritten documents in administrative documents, historical archives, postal mails, etc. Some available OCR engines [1] do not work well in understanding handwritten documents. The output of such OCRs cannot be used for date extracting compilers because of poor recognition result. Hence, date extraction process from such documents will be very useful in searching and interpretation. To the best of our knowledge, there is no work that can search date pattern in printed/handwritten documents. Date pattern detection and interpretation in handwritten documents is a challenging task due the unconstrained handwriting styles of different individuals. Alpha-numeric characters that represent date are sometimes touching and recognition confusion between numerals and alphabet makes the task more challenging. We have shown two examples of handwritten documents containing date information in Fig.1. It is to be noted that, the date patterns appear in different format in documents. Some of these formats of a single date are 12/03/2012 or 12 th March, 2012 or March 12, 2012 or 12-03-2012 or 12.03.2012.or 12.03.12, etc. Automatic searching of such different date patterns from the documents is difficult. Few research works have been published for automatic form field extraction from handwritten documents [2, 3, 4]. Recently, field based information retrieval got more popularity than recognition of full handwriting document. Koch et al. [3] proposed a method using HMM for numerical field extraction. To localize the desired numerical fields, syntactic analyzer has been applied over the handwritten text lines. Thomas et al. [2] proposed a HMM based classification model for alpha-numerical sequence recognition. Chatelain et al. [4] proposed an approach to locate numerical sequence using a segmentation-driven recognition. To extract the desired numerical sequence, a syntactical analysis has been performed on each line of text. Most of the papers mentioned before deals with alpha-numeric string extraction. This paper moves a step further in document interpretation and uses the recognition labels of alpha-numeric characters to locate the date fields in documents and this is the first work on date extraction. Figure 1. Sample handwritten documents containing date fields. Numeric and semi-numeric date fields are marked with blue and red rectangle, respectively. A block diagram of our proposed system is shown in Fig.2. A three-stage approach has been proposed here for date field extraction. In the first stage, month and non-month handwritten word blocks are separated. For this purpose, words blocks are extracted using morphological operation and the segmented word blocks 21st International Conference on Pattern Recognition (ICPR 2012) November 11-15, 2012. Tsukuba, Japan 978-4-9906441-0-9 ©2012 ICPR 533