Character Extraction in Web Image for Text Recognition
Bolan Su
12*
, Shijian Lu
2+
, Trung Quy Phan
1
and Chew Lim Tan
1*
1
Department of Computer Science,School of Computing,National University of Singapore
Computing 1, 13 Computing Drive, Singapore 117417
2
Department of Computer Vision and Image Understanding,Institute for Infocomm Research
1 Fusionopolis Way, #21-01 Connexis, Singapore 138632
*
{subolan,phanquyt,tancl}@comp.nus.edu.sg,
+
slu@i2r.a-star.edu.sg
Abstract
Images with text are frequently used on Internet for
different purposes. Automatic recognition of text from
web images plays an important role on extraction and
retrieval of web information. However, the web images
are usually in low resolution with artifacts and special
effects, which makes word recognition a challenge task
even after the text has been localized. In this paper, we
propose a robust text recognition technique to efficiently
convert the web images into text format. The proposed
technique first makes use of the L0 norm smoothing to
increase the edge contrast of the input web images. The
images are then binarized on each color channel. A
connected component analysis is followed to identify
the possible character components. Finally the charac-
ter candidates are recognized by the OCR engine after
skew correction. Extensive experiments have been con-
ducted on the latest ICDAR 2011 robust reading com-
petition dataset for born-digital text. The experimental
results show the superior performance of our proposed
technique.
1. Introduction
The images on Internet are increasing tremendously
during these years. Many of these images contain text
information that cannot be found in other places of the
web pages [2]. The recognition of the textual informa-
tion within web images is very helpful for a better un-
derstanding of the contents of web pages. As these im-
ages with text embed are used in Internet for different
purposes, text recognition in web images can be applied
on different kinds of applications, such as web page in-
dexing & retrieval, web page content filtering [3]. It will
become even more important as the textual information
within web images is contributing more and more due
(a) (b)
(c) (d)
Figure 1. Some low quality web image ex-
amples
to the future network development.
Many techniques have been proposed for text ex-
traction and recognition on videos and natural scene
images [6, 10], but much fewer efforts have been re-
ported for the recognition of the text within web im-
ages [3, 5]. Compared with other images, web im-
ages are often more susceptible to certain specific image
degradations including low resolution and small size for
faster network transmission rate, computer-generate-
character artifacts, and special effects on images for at-
tractiveness purpose. As a result, the techniques devel-
oped for video/natural scene images often fail to pro-
duce satisfactory results when they are directly applied
for web images.
The latest Robust Reading Competition in Born-
Digital Images (Web and Email) held under the frame-
work of International Conference on Document Analy-
sis and Recognition (ICDAR) 2011 [3] shows current
research progress on this area. The contest consists
of three tasks, i.e. text localization, text segmentation
and word recognition in web image. The third recog-
nition task aims to convert the textual information from
bitmap format to ASCII format where the text regions
21st International Conference on Pattern Recognition (ICPR 2012)
November 11-15, 2012. Tsukuba, Japan
978-4-9906441-1-6 ©2012 IAPR 3042