Segmentation-free Word Spotting in Historical Printed Documents
B. Gatos and I. Pratikakis
Computational Intelligence Laboratory,
Institute of Informatics and Telecommunications,
National Research Center "Demokritos",
153 10 Athens, Greece
{bgat, ipratika}@iit.demokritos.gr
Abstract
In this paper, a new efficient word spotting
methodology is presented that can be applied to
historical printed documents without requiring any
previous block or word segmentation step. Our aim is
to address a methodology which is segmentation-free
since in many cases of historical documents, the
segmentation process does not produce meaningful
results due to unconstraint layout, several
degradations or typesetting imperfections. The
proposed method is based on block-based document
image descriptors that are used at a template matching
process satisfying invariance in terms of translation,
rotation and scaling. Improvement in terms of time
expense is obtained by applying the matching process
only on salient regions of the image. Experimental
results on a database with representative historical
printed documents prove the efficiency of the proposed
approach.
1. Introduction
Effective historical document indexing and retrieval
poses a great challenge due to the vast amount of
information that is available in libraries all over the
world in the form of printed or handwritten
manuscripts. The challenge is amplified by the
variability of documents due to the multi-linguality and
the wide range of historical periods that available
collections are built, as well as by the poor quality of
existing historical documents.
Word spotting is a content-based retrieval
procedure which results in a ranked list of word
images that are similar to a query word image. The
query comprises either an actual example from the
collection of interest or it is artificially generated from
an ASCII keyword. A crucial aspect in the retrieval
procedure is the word image representation which
relies upon robust features. The word spotting
procedure is mostly used in an unsupervised manner
and the lack of dependencies like training along with
the ease to use several different feature variations
make it as a very appealing alternative to Optical
Character Recognition (OCR) which is a difficult
problem to solve, especially for historical documents.
In the literature, word spotting appears under two
distinct trends: the segmentation-based approach and
the segmentation-free approach. In the former
approach, there is a tremendous effort towards solving
the word segmentation problem [1-4].
In the latter approach, the query word image is
fitted to the corresponding word images in the
document without any segmentation involved, mostly
seen the underlying problem as a template matching.
Representative work is reported in [5], which uses
differential features that are compared using a cohesive
elastic matching method, based on zones of interest in
order to match only the informative parts of the words.
In the same spirit with the aforementioned
approach, this paper concerns a segmentation-free
word spotting methodology which permits a fast and
effective retrieval based on block-based document
image descriptors that are used at a template matching
process satisfying invariance in terms of translation,
rotation and scaling.
The remainder of the paper will be structured as
follows. The proposed methodology is detailed in
Section 2. In Section 3, the evaluation results on
representative historical documents are presented, and
in Section 4, conclusions are drawn.
2009 10th International Conference on Document Analysis and Recognition
978-0-7695-3725-2/09 $25.00 © 2009 IEEE
DOI 10.1109/ICDAR.2009.236
271