Text Binarization in Color Documents Efthimios Badekas, Nikos Nikolaou, Nikos Papamarkos Department of Electrical and Computer Engineering, Image Processing and Multimedia Laboratory, Democritus University of Thrace, 67100 Xanthi, Greece Received 18 July 2006; revised 14 February 2007; accepted 14 March 2007 ABSTRACT: This article presents a new method for the binarization of color document images. Initially, the colors of the document image are reduced to a small number using a new color reduction tech- nique. Specifically, this technique estimates the dominant colors and then assigns the original image colors to them in order that the back- ground and text components to become uniform. Each dominant color defines a color plane in which the connected components (CCs) are extracted. Next, in each color plane a CC filtering procedure is applied which is followed by a grouping procedure. At the end of this stage, blocks of CCs are constructed which are next redefined by obtaining the direction of connection (DOC) property for each CC. Using the DOC property, the blocks of CCs are classified as text or nontext. The identified text blocks are binarized properly using suita- ble binarization techniques, considering the rest of the pixels as background. The final result is a binary image which contains always black characters in white background independently of the original colors of each text block. The proposed document binarization approach can also be used for binarization of noisy color (or gray- scale) document images. Several experiments that confirm the effec- tiveness of the proposed technique are presented. V V C 2007 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 16, 262–274, 2006; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ ima.20092 Key words: color quantization; text localization; binarization; seg- mentation; document processing I. INTRODUCTION Interest about exploiting text information in images and video has grown notably during the past years. The ability of text to provide powerful description of the image content, the convenience of distinguishing it from other image features and the provision of extremely important information, reasonably attracts the research interest. Content-based image retrieval, OCR, page segmentation, license plate location, address block location, and compression, are some examples based on text information extraction from various types of images. A main categorization of text identification methods include texture based techniques (Jain and Bhattacharjee, 1992; Jain and Zhong, 1996) and connected components (CCs) based techniques (Fletcher and Kasturi, 1988; O’Gorman, 1993; Chen and Chen, 1998; Sobottka et al., 2000; Hase et al., 2001; Strouthopoulos et al., 2002). Some hybrid approaches have also been reported in the literature (Zhong et al., 1995; Jung and Han, 2004). Texture based techniques are time consuming and use character size restrictions. The main advantage is their capability of detecting text in low reso- lution images. On the other hand, CCs based techniques are fast and exploit the fact that characters are segmented. Most approaches for text identification refer to gray or binary document images. Only recently, some techniques have been pro- posed for text identification and extraction in color documents. Strouthopoulos et al. (2002) proposed a method for text extraction in complex color documents. It is based on a combination of an adaptive color reduction technique and a page layout analysis approach, which uses a Kohonen SOM neural network in order to identify text blocks. Zhong et al. (1995) presented a hybrid system for text localization in complex color images. According to this sys- tem, a color segmentation stage is performed by identifying local maxima in the color histogram. Heuristic filters on the CCs of the same color plane are applied and noncharacter components are removed. A second approach based on local spatial variance, which locates text lines, is also proposed. Chen and Chen (1998) proposed a method for text block localization on color technical journals cover images. Initially, the colors of the image are reduced using a YIQ color model based algorithm. With the Sobel operator and through a binarization process, strong edges are isolated. Primary blocks are then detected with the Run Length Smearing Algorithm and finally classified with the use of nine features that underlie on fuzzy rules. Sobottka et al. (2000) proposed an approach to extract text from color documents and journal covers. The image is quan- tized with an unsupervised clustering method and the text regions are then identified combining a top-down and a bottom-up technique. An algorithm for character string extraction from color documents is presented by Hase et al. (2001). First the number of representative colors of a document is determined. Potential charac- ter strings are then extracted from each color plane using multi- stage relaxation. When all extracted elements are superimposed, a strategy which utilizes the likelihood of a character string and a conflict resolution is followed to produce the final result. A detailed review on text information extraction techniques is presented by Jung et al. (2004). Correspondence to: Prof. Nikos Papamarkos; e-mail: papamark@ee.duth.gr Contract grant sponsor: TEI Serron ' 2007 Wiley Periodicals, Inc.