Pattern Recognition 43 (2010) 369--377
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
Handwritten document image segmentation into text lines and words
Vassilis Papavassiliou
a,b, ∗
, Themos Stafylakis
a,b
, Vassilis Katsouros
a
, George Carayannis
a,b
a
Institute for Language and Speech Processing of R.C. “Athena” Artemidos 6 & Epidavrou, GR-151 25 Maroussi, Greece
b
National Technical University of Athens, School of Electrical and Computer Engineers, 9, Iroon Polytechniou str, GR 157 80 Athens, Greece
ARTICLE INFO ABSTRACT
Article history:
Received 22 July 2008
Received in revised form 23 February 2009
Accepted 14 May 2009
Keywords:
Handwritten text line segmentation
Handwritten word segmentation
Document image processing
Viterbi estimation
Support vector machines
Two novel approaches to extract text lines and words from handwritten document are presented. The
line segmentation algorithm is based on locating the optimal succession of text and gap areas within
vertical zones by applying Viterbi algorithm. Then, a text-line separator drawing technique is applied and
finally the connected components are assigned to text lines. Word segmentation is based on a gap metric
that exploits the objective function of a soft-margin linear SVM that separates successive connected
components. The algorithms tested on the benchmarking datasets of ICDAR07 handwriting segmentation
contest and outperformed the participating algorithms.
© 2009 Elsevier Ltd. All rights reserved.
1. Introduction
Document image segmentation to text lines and words is a crit-
ical stage towards unconstrained handwritten document recogni-
tion. Variation of the skew angle between text lines or along the
same text line, existence of overlapping or touching lines, variable
character size and non-Manhattan layout are the challenges of text
line extraction. Due to high variability of writing styles, scripts, etc.,
methods that do not use any prior knowledge and adapt to the prop-
erties of the document image, as the proposed, would be more ro-
bust. Line extraction techniques may be categorized as projection
based, grouping, smearing and Hough-based [1].
Global projections based approaches are very effective for ma-
chine printed documents but cannot handle text lines with differ-
ent skew angles. However, they can be applied for skew correction
in documents with constant skew angle [2]. Hough-based methods
handle documents with variation in the skew angle between text
lines, but are not very effective when the skew of a text line varies
along its width [3]. Thus, we adopt piece-wise projections which can
deal with both types of skew angle variation [4,5].
On the other hand, piece-wise projections are sensitive to
characters' size variation within text lines and significant gaps
between successive words. These occurrences influence the
∗
Corresponding author at: Institute for Language and Speech Processing of R.C.
“Athena” Artemidos 6 & Epidavrou, GR-151 25 Maroussi, Greece.
Tel.: +30 210 6875332; fax: +30 210 6854270.
E-mail addresses: vpapa@ilsp.gr (V. Papavassiliou), themosst@ilsp.gr
(T. Stafylakis), vsk@ilsp.gr (V. Katsouros), gcara@ilsp.gr (G. Carayannis).
0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2009.05.007
effectiveness of smearing methods too [6]. In such cases, the results
of two adjacent zones may be ambiguous, affecting the drawing of
text-line separators along the document width. To deal with these
problems we introduce a smooth version of the projection profiles
to oversegment each zone into candidate text and gap regions. Then,
we reclassify these regions by applying an HMM formulation that
enhances statistics from the whole document page. Starting from
left and moving to the right we combine separators of consecutive
zones considering their proximity and the local foreground density.
Grouping approaches can handle complex layouts, but they fail
to distinguish touching text lines [7]. In our approach, we deal with
such a case by splitting the respective connected component (CC)
and assign the individual parts to the corresponding text lines.
In word segmentation, most of the proposed techniques consider
a spatial measure of the gap between successive CCs and define a
threshold to classify “within” and “between” word gaps [8]. These
measures are sensitive to CCs' shape, e.g. a simple extension of the
horizontal part of character “t”. We introduce a novel gap measure
which is more tolerant to such cases. The proposed measure results
from the optimal value of the objective function of a soft-margin
linear SVM that separates consecutive CCs.
Preliminary versions of the text-line and word segmentation al-
gorithms were submitted to the Handwriting Segmentation Contest
in ICDAR07, under the name ILSP-LWSeg, and performed the best
results [9]. A short description of the participating algorithms was
published in our conference paper [10]. The major steps of the pro-
posed algorithms are illustrated in Fig. 1.
The organization of the rest of the paper is as follows: In Section 2,
we refer to recent related work. In Section 3, we describe in detail
the algorithm for text-line extraction from handwritten document