800 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011
A Hybrid Approach to Detect and Localize Texts in
Natural Scene Images
Yi-Feng Pan, Xinwen Hou, and Cheng-Lin Liu, Senior Member, IEEE
Abstract—Text detection and localization in natural scene im-
ages is important for content-based image analysis. This problem
is challenging due to the complex background, the non-uniform il-
lumination, the variations of text font, size and line orientation. In
this paper, we present a hybrid approach to robustly detect and
localize texts in natural scene images. A text region detector is de-
signed to estimate the text existing confidence and scale informa-
tion in image pyramid, which help segment candidate text com-
ponents by local binarization. To efficiently filter out the non-text
components, a conditional random field (CRF) model considering
unary component properties and binary contextual component re-
lationships with supervised parameter learning is proposed. Fi-
nally, text components are grouped into text lines/words with a
learning-based energy minimization method. Since all the three
stages are learning-based, there are very few parameters requiring
manual tuning. Experimental results evaluated on the ICDAR 2005
competition dataset show that our approach yields higher precision
and recall performance compared with state-of-the-art methods.
We also evaluated our approach on a multilingual image dataset
with promising results.
Index Terms—Conditional random field (CRF), connected com-
ponent analysis (CCA), text detection, text localization.
I. INTRODUCTION
W
ITH the increasing use of digital image capturing
devices, such as digital cameras, mobile phones and
PDAs, content-based image analysis techniques are receiving
intensive attention in recent years. Among all the contents in
images, text information has inspired great interests, since it can
be easily understood by both human and computer, and finds
wide applications such as license plate reading, sign detection
and translation, mobile text recognition, content-based web
image search, and so on [19]. Jung et al. [13] define an inte-
grated image text information extraction system (TIE, shown
in Fig. 1) with four stages: text detection, text localization, text
extraction and enhancement, and recognition. Among these
stages, text detection and localization, bounded in dashed line
in Fig. 1, are critical to the overall system performance. In the
last decade, many methods (as surveyed in [13], [19], [47])
Manuscript received July 06, 2009; revised March 31, 2010; accepted Au-
gust 07, 2010. Date of publication September 02, 2010; date of current version
February 18, 2011. This work was supported by the National Natural Science
Foundation of China (NSFC) under Grant 60775004 and Grant 60825301. The
associate editor coordinating the review of this manuscript and approving it for
publication was Prof. Margaret Cheney.
The authors are with the National Laboratory of Pattern Recognition
(NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA),
Beijing 100190, China (e-mail: yfpan@nlpr.ia.ac.cn; xwhou@nlpr.ia.ac.cn;
liucl@nlpr.ia.ac.cn).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2010.2070803
Fig. 1. Architecture of a TIE system [13].
have been proposed to address image and video text detection
and localization problems, and some of them have achieved
impressive results for specific applications. However, fast
and accurate text detection and localization in natural scene
images is still a challenge due to the variations of text font,
size, color and alignment orientation, and it is often affected by
complex background, illumination changes, image distortion
and degrading.
The existing methods of text detection and localization can
be roughly categorized into two groups: region-based and
connected component (CC)-based. Region-based methods
attempt to detect and localize text regions by texture analysis.
Generally, a feature vector extracted from each local region
is fed into a classifier for estimating the likelihood of text.
Then neighboring text regions are merged to generate text
blocks. Because text regions have distinct textural properties
from non-text ones, these methods can detect and localize texts
accurately even when images are noisy. On the other hand,
CC-based methods directly segment candidate text components
by edge detection or color clustering. The non-text components
are then pruned with heuristic rules or classifiers. Since the
number of segmented candidate components is relatively small,
CC-based methods have lower computation cost and the located
text components can be directly used for recognition.
Although the existing methods have reported promising
localization performance, there still remain several problems to
solve. For region-based methods, the speed is relatively slow
and the performance is sensitive to text alignment orientation.
On the other hand, CC-based methods cannot segment text
components accurately without prior knowledge of text posi-
tion and scale. Moreover, designing fast and reliable connected
component analyzer is difficult since there are many non-text
components which are easily confused with texts when ana-
lyzed individually.
1057-7149/$26.00 © 2011 IEEE