A learning framework for the optimization and automation of document binarization methods q Mohamed Cheriet, Reza Farrahi Moghaddam ⇑ , Rachid Hedjam Synchromedia Laboratory for Multimedia Communication in Telepresence, École de technologie supérieure, Montreal, QC, Canada H3C 1K3 article info Article history: Received 24 October 2011 Accepted 19 November 2012 Available online 29 November 2012 Keywords: Document image processing Binarization Parametric methods Learning machines Multi-level maps abstract Almost all binarization methods have a few parameters that require setting. However, they do not usually achieve their upper-bound performance unless the parameters are individually set and optimized for each input document image. In this work, a learning framework for the optimization of the binarization methods is introduced, which is designed to determine the optimal parameter values for a document image. The framework, which works with any binarization method, has a standard structure, and per- forms three main steps: (i) extracts features, (ii) estimates optimal parameters, and (iii) learns the rela- tionship between features and optimal parameters. First, an approach is proposed to generate numerical feature vectors from 2D data. The statistics of various maps are extracted and then combined into a ﬁnal feature vector, in a nonlinear way. The optimal behavior is learned using support vector regression (SVR). Although the framework works with any binarization method, two methods are considered as typical examples in this work: the grid-based Sauvola method, and Lu’s method, which placed ﬁrst in the DIB- CO’09 contest. The experiments are performed on the DIBCO’09 and H-DIBCO’10 datasets, and combina- tions of these datasets with promising results. Ó 2012 Elsevier Inc. All rights reserved. 1. Introduction The binarization and segmentation of document images is an important step in the digitization workﬂows [1,2]. A huge number of document images is produced by the growing movement toward the digitization of old and historically important manuscripts around the world [3]. Although various models and methods have been developed to achieve this goal, binarization is still an open problem for the Document Image Analysis and Retrieval (DIAR) community [4–11]. Old manuscripts and documents usually suffer from severe physical degradation of various types, and each type requires complex modeling in order to restore them and make it possible to understand them [12]. Degradation ranges from bleed-through, faded ink, and deteriorated paper to stains, which occur mostly because of some physical phenomena that affect the manuscripts over time [12–14] (see Fig. 1). Although there is a need for a generic and uniﬁed model that can address all these forms of degradation, it is very difﬁcult, if not impossible, to devel- op a method that is capable of eliminating all types of degradation. There are two options that could be considered in order to re- move this modeling barrier. One is to ﬁt a generic and parametric model to the available data and learn the behavior of the optimal values of its parameters. The other is to attempt to categorize doc- ument images and then assign an optimal model, or an optimized instance of a model, to each category [16]. The later approach can be built on top of a classiﬁcation and regression tree (CART) [17], and this classiﬁer is usually preferred over other decision tree clas- siﬁers, because it can handle both categorical and continuous fea- tures [18]. Both parameter optimization and categorization have the ability to reduce the subjective aspect of modeling, and at the same time, avoid the extra complexity associated with design- ing more comprehensive models. As the learning approach re- quires only the ability to optimize a model, that will be the focus here. The aim of this paper is to introduce a generic framework for learning the optimal parameter values of any binarization method. This framework could have many applications. One of them is the conversion of a parametric method into an automatic one, a capa- bility that could revive interest in many simple, high-performance methods which practitioners dislike or have ignored. Sauvola’s method is one of these, and we have shown that one of its variations, the grid-based Sauvola method [6], can achieve a per- formance that competes well against other state-of-the-art methods, without the need for any additional preprocessing or postprocessing steps. It is important to note, however, that learn- ing optimal parameter values is not the sole purpose of this frame- work. Even a state-of-the-art method can be tuned using the 1077-3142/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.cviu.2012.11.003 q This paper has been recommended for acceptance by Daniel Lopresti. ⇑ Corresponding author. Fax: +1 (514) 396 8595. E-mail addresses: mohamed.cheriet@etsmtl.ca (M. Cheriet), imriss@ieee.org (R. Farrahi Moghaddam). Computer Vision and Image Understanding 117 (2013) 269–280 Contents lists available at SciVerse ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu