Pattern Recognition 43 (2010) 369--377 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr Handwritten document image segmentation into text lines and words Vassilis Papavassiliou a,b, , Themos Stafylakis a,b , Vassilis Katsouros a , George Carayannis a,b a Institute for Language and Speech Processing of R.C. “Athena” Artemidos 6 & Epidavrou, GR-151 25 Maroussi, Greece b National Technical University of Athens, School of Electrical and Computer Engineers, 9, Iroon Polytechniou str, GR 157 80 Athens, Greece ARTICLE INFO ABSTRACT Article history: Received 22 July 2008 Received in revised form 23 February 2009 Accepted 14 May 2009 Keywords: Handwritten text line segmentation Handwritten word segmentation Document image processing Viterbi estimation Support vector machines Two novel approaches to extract text lines and words from handwritten document are presented. The line segmentation algorithm is based on locating the optimal succession of text and gap areas within vertical zones by applying Viterbi algorithm. Then, a text-line separator drawing technique is applied and finally the connected components are assigned to text lines. Word segmentation is based on a gap metric that exploits the objective function of a soft-margin linear SVM that separates successive connected components. The algorithms tested on the benchmarking datasets of ICDAR07 handwriting segmentation contest and outperformed the participating algorithms. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction Document image segmentation to text lines and words is a crit- ical stage towards unconstrained handwritten document recogni- tion. Variation of the skew angle between text lines or along the same text line, existence of overlapping or touching lines, variable character size and non-Manhattan layout are the challenges of text line extraction. Due to high variability of writing styles, scripts, etc., methods that do not use any prior knowledge and adapt to the prop- erties of the document image, as the proposed, would be more ro- bust. Line extraction techniques may be categorized as projection based, grouping, smearing and Hough-based [1]. Global projections based approaches are very effective for ma- chine printed documents but cannot handle text lines with differ- ent skew angles. However, they can be applied for skew correction in documents with constant skew angle [2]. Hough-based methods handle documents with variation in the skew angle between text lines, but are not very effective when the skew of a text line varies along its width [3]. Thus, we adopt piece-wise projections which can deal with both types of skew angle variation [4,5]. On the other hand, piece-wise projections are sensitive to characters' size variation within text lines and significant gaps between successive words. These occurrences influence the Corresponding author at: Institute for Language and Speech Processing of R.C. “Athena” Artemidos 6 & Epidavrou, GR-151 25 Maroussi, Greece. Tel.: +30 210 6875332; fax: +30 210 6854270. E-mail addresses: vpapa@ilsp.gr (V. Papavassiliou), themosst@ilsp.gr (T. Stafylakis), vsk@ilsp.gr (V. Katsouros), gcara@ilsp.gr (G. Carayannis). 0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2009.05.007 effectiveness of smearing methods too [6]. In such cases, the results of two adjacent zones may be ambiguous, affecting the drawing of text-line separators along the document width. To deal with these problems we introduce a smooth version of the projection profiles to oversegment each zone into candidate text and gap regions. Then, we reclassify these regions by applying an HMM formulation that enhances statistics from the whole document page. Starting from left and moving to the right we combine separators of consecutive zones considering their proximity and the local foreground density. Grouping approaches can handle complex layouts, but they fail to distinguish touching text lines [7]. In our approach, we deal with such a case by splitting the respective connected component (CC) and assign the individual parts to the corresponding text lines. In word segmentation, most of the proposed techniques consider a spatial measure of the gap between successive CCs and define a threshold to classify “within” and “between” word gaps [8]. These measures are sensitive to CCs' shape, e.g. a simple extension of the horizontal part of character “t”. We introduce a novel gap measure which is more tolerant to such cases. The proposed measure results from the optimal value of the objective function of a soft-margin linear SVM that separates consecutive CCs. Preliminary versions of the text-line and word segmentation al- gorithms were submitted to the Handwriting Segmentation Contest in ICDAR07, under the name ILSP-LWSeg, and performed the best results [9]. A short description of the participating algorithms was published in our conference paper [10]. The major steps of the pro- posed algorithms are illustrated in Fig. 1. The organization of the rest of the paper is as follows: In Section 2, we refer to recent related work. In Section 3, we describe in detail the algorithm for text-line extraction from handwritten document