Date Field Extraction in Handwritten Documents
Ranju Mandal
Computer Vision and Pattern
Recognition Unit, Indian Statistical
Institute, Kolkata-108, India
ranjumandal@gmail.com
Partha Pratim Roy
Laboratoire d’Informatique
Université François Rabelais
Tours, France
partha.roy@univ-tours.fr
Umapada Pal
Computer Vision and Pattern
Recognition Unit, Indian Statistical
Institute, Kolkata-108, India
umapada@isical.ac.in
Abstract
Automatic extraction of date patterns from handwritten
document involves difficult challenges due to writing
styles of different individuals, touching characters and
confusion among identification of alphabets and digits.
In this paper, we propose a framework for retrieval of
date patterns from handwritten documents. The method
first classifies word components of each text line into
month and non-month class using word level feature.
Next, non-month words are segmented into individual
components and classified into one of alphabet, digit or
punctuation. Using this information of word and
character level components, the date patterns are
searched first using voting approach and then we detect
the candidate lines for numeric and semi-numeric date
using regular expression. Gradient based features and
Support Vector Machine (SVM) are used in our work
for classification. The experiment is performed on
handwritten dataset and we have obtained encouraging
results from it.
I. INTRODUCTION
Date is useful and important information that could
be used as key for searching and indexing of
handwritten documents in administrative documents,
historical archives, postal mails, etc. Some available
OCR engines [1] do not work well in understanding
handwritten documents. The output of such OCRs
cannot be used for date extracting compilers because of
poor recognition result. Hence, date extraction process
from such documents will be very useful in searching
and interpretation. To the best of our knowledge, there
is no work that can search date pattern in
printed/handwritten documents.
Date pattern detection and interpretation in
handwritten documents is a challenging task due the
unconstrained handwriting styles of different
individuals. Alpha-numeric characters that represent
date are sometimes touching and recognition confusion
between numerals and alphabet makes the task more
challenging. We have shown two examples of
handwritten documents containing date information in
Fig.1. It is to be noted that, the date patterns appear in
different format in documents. Some of these formats
of a single date are 12/03/2012 or 12
th
March, 2012 or
March 12, 2012 or 12-03-2012 or 12.03.2012.or
12.03.12, etc. Automatic searching of such different
date patterns from the documents is difficult.
Few research works have been published for
automatic form field extraction from handwritten
documents [2, 3, 4]. Recently, field based information
retrieval got more popularity than recognition of full
handwriting document. Koch et al. [3] proposed a
method using HMM for numerical field extraction. To
localize the desired numerical fields, syntactic analyzer
has been applied over the handwritten text lines.
Thomas et al. [2] proposed a HMM based classification
model for alpha-numerical sequence recognition.
Chatelain et al. [4] proposed an approach to locate
numerical sequence using a segmentation-driven
recognition. To extract the desired numerical sequence,
a syntactical analysis has been performed on each line of
text. Most of the papers mentioned before deals with
alpha-numeric string extraction. This paper moves a step
further in document interpretation and uses the
recognition labels of alpha-numeric characters to locate
the date fields in documents and this is the first work on
date extraction.
Figure 1. Sample handwritten documents containing date
fields. Numeric and semi-numeric date fields are marked
with blue and red rectangle, respectively.
A block diagram of our proposed system is shown in
Fig.2. A three-stage approach has been proposed here
for date field extraction. In the first stage, month and
non-month handwritten word blocks are separated. For
this purpose, words blocks are extracted using
morphological operation and the segmented word blocks
21st International Conference on Pattern Recognition (ICPR 2012)
November 11-15, 2012. Tsukuba, Japan
978-4-9906441-0-9 ©2012 ICPR 533