TEXTLINE INFORMATION EXTRACTION FROM GRAYSCALE CAMERA-CAPTURED DOCUMENT IMAGES Syed Saqib Bukhari and Thomas M. Breuel Technical University of Kaiserslautern, Germany bukhari@informatik.uni-kl.de, tmb@informatik.uni-kl.de Faisal Shafait German Research Center for Artificial Intelligence (DFKI) Kaiserslautern, Germany faisal@iupr.dfki.de ABSTRACT Cameras offer flexible document imaging, but with uneven shading and non-planar page shape. Therefore camera- captured documents need to go through dewarping before be- ing processed by traditional text recognition methods. Curled textline detection is an important step of dewarping. Previous approaches of curled textline detection use binarization as a pre-processing step, which can negatively affect the detection results under uneven shading. Furthermore, these approaches are sensitive to high degrees of curl and estimate x-line 1 and baseline pairs using regression which may result in inaccurate estimation. We introduce a novel curled textline detection ap- proach for grayscale document images. First, the textline structure is enhanced by using match filter bank smoothing and then central lines of textlines are detected using ridges. Then, x-line and baseline pairs are estimated by adapting ac- tive contours (snakes) over ridges. Unlike other approaches, our approach does not use binarization and applies directly on grayscale images. We achieved 91% of detection accu- racy with good estimation of x-line and baseline pairs on the dataset of CBDAR 2007 document image dewarping contest. Index Terms— Curled Textline Detection, Grayscale Camera-Captured Document Image Segmentation. 1. INTRODUCTION Digital cameras are low priced, portable, long-ranged and non-contact imaging devices as compared to scanners. These features make cameras suitable for versatile OCR related ap- plications like, mobile OCR, digitizing thick books, digitizing fragile historical documents, etc. But fundamental obstacles like uneven light shading, low resolution, motion blur, under- or over-exposure, non-planar page shape and perspective dis- tortions bring up new problems to the traditional OCR system. Therefore some pre-processing steps like “binarization” and “dewarping” are performed before text recognition. Curled textline detection and x-line and baseline pairs estimation are 1 Line following the top of x-height of characters. important steps for dewarping. Previous approaches of curled textline detection [1, 2, 3, 4, 5, 6, 7, 8] work on binarized im- ages. These approaches can be divided into two categories: (a) heuristic search [1, 2, 3, 4, 5, 6] and (b) active contours (snakes) [7, 8]. Heuristic search based approaches start from a single component and search other components in a growing neigh- borhood region. Most of these approaches use rule-based criteria for textline searching. Our active contours (snakes) based baby-snakes [7] model introduced the use of many small open curved snakes and snakelets [8] model introduced many small growing coupled snakes pairs for curled textlines detection from binarized images. Heuristic search and active contours (snakes) based ap- proaches depend upon binarization [9] of camera-captured document image before textline detection. In the presence of the challenging conditions like uneven shading, low reso- lution, motion blur and under- or over-exposure, binarization may give bad results. Therefore, binarization can negatively affect the textline detection results. Moreover, most of these approaches perform x-line and baseline pairs estimation using regression over top and bot- tom points of binarized connected components of detected textlines, which may result in inaccurate estimation. We have introduced a method for textline detection di- rectly from grayscale document images in [10]. Our work in this paper is the extension of our previous work [10] with the additional estimation of x-line and baseline pairs. Here, our method starts by enhancing the grayscale curled textline structure using multi-oriented multi-scale anisotropic Gaus- sian smoothing based on matched filter bank approach [11]. Then ridges [12, 13] are detected from the smoothed im- age, where ridges defines the unbroken central lines structure which pass through the centers of the textlines. Then we model active contours (snakes) [14] over ridges for estimat- ing the pairs of x-line and baseline. We make the following contributions in this paper. The method presented here works directly on grayscale intensi- ties of camera-captured document images and therefore inde-