800 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 A Hybrid Approach to Detect and Localize Texts in Natural Scene Images Yi-Feng Pan, Xinwen Hou, and Cheng-Lin Liu, Senior Member, IEEE Abstract—Text detection and localization in natural scene im- ages is important for content-based image analysis. This problem is challenging due to the complex background, the non-uniform il- lumination, the variations of text font, size and line orientation. In this paper, we present a hybrid approach to robustly detect and localize texts in natural scene images. A text region detector is de- signed to estimate the text existing conﬁdence and scale informa- tion in image pyramid, which help segment candidate text com- ponents by local binarization. To efﬁciently ﬁlter out the non-text components, a conditional random ﬁeld (CRF) model considering unary component properties and binary contextual component re- lationships with supervised parameter learning is proposed. Fi- nally, text components are grouped into text lines/words with a learning-based energy minimization method. Since all the three stages are learning-based, there are very few parameters requiring manual tuning. Experimental results evaluated on the ICDAR 2005 competition dataset show that our approach yields higher precision and recall performance compared with state-of-the-art methods. We also evaluated our approach on a multilingual image dataset with promising results. Index Terms—Conditional random ﬁeld (CRF), connected com- ponent analysis (CCA), text detection, text localization. I. INTRODUCTION W ITH the increasing use of digital image capturing devices, such as digital cameras, mobile phones and PDAs, content-based image analysis techniques are receiving intensive attention in recent years. Among all the contents in images, text information has inspired great interests, since it can be easily understood by both human and computer, and ﬁnds wide applications such as license plate reading, sign detection and translation, mobile text recognition, content-based web image search, and so on [19]. Jung et al. [13] deﬁne an inte- grated image text information extraction system (TIE, shown in Fig. 1) with four stages: text detection, text localization, text extraction and enhancement, and recognition. Among these stages, text detection and localization, bounded in dashed line in Fig. 1, are critical to the overall system performance. In the last decade, many methods (as surveyed in [13], [19], [47]) Manuscript received July 06, 2009; revised March 31, 2010; accepted Au- gust 07, 2010. Date of publication September 02, 2010; date of current version February 18, 2011. This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 60775004 and Grant 60825301. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Margaret Cheney. The authors are with the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China (e-mail: yfpan@nlpr.ia.ac.cn; xwhou@nlpr.ia.ac.cn; liucl@nlpr.ia.ac.cn). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TIP.2010.2070803 Fig. 1. Architecture of a TIE system [13]. have been proposed to address image and video text detection and localization problems, and some of them have achieved impressive results for speciﬁc applications. However, fast and accurate text detection and localization in natural scene images is still a challenge due to the variations of text font, size, color and alignment orientation, and it is often affected by complex background, illumination changes, image distortion and degrading. The existing methods of text detection and localization can be roughly categorized into two groups: region-based and connected component (CC)-based. Region-based methods attempt to detect and localize text regions by texture analysis. Generally, a feature vector extracted from each local region is fed into a classiﬁer for estimating the likelihood of text. Then neighboring text regions are merged to generate text blocks. Because text regions have distinct textural properties from non-text ones, these methods can detect and localize texts accurately even when images are noisy. On the other hand, CC-based methods directly segment candidate text components by edge detection or color clustering. The non-text components are then pruned with heuristic rules or classiﬁers. Since the number of segmented candidate components is relatively small, CC-based methods have lower computation cost and the located text components can be directly used for recognition. Although the existing methods have reported promising localization performance, there still remain several problems to solve. For region-based methods, the speed is relatively slow and the performance is sensitive to text alignment orientation. On the other hand, CC-based methods cannot segment text components accurately without prior knowledge of text posi- tion and scale. Moreover, designing fast and reliable connected component analyzer is difﬁcult since there are many non-text components which are easily confused with texts when ana- lyzed individually. 1057-7149/$26.00 © 2011 IEEE