Character Extraction in Web Image for Text Recognition Bolan Su 12* , Shijian Lu 2+ , Trung Quy Phan 1 and Chew Lim Tan 1* 1 Department of Computer Science,School of Computing,National University of Singapore Computing 1, 13 Computing Drive, Singapore 117417 2 Department of Computer Vision and Image Understanding,Institute for Infocomm Research 1 Fusionopolis Way, #21-01 Connexis, Singapore 138632 * {subolan,phanquyt,tancl}@comp.nus.edu.sg, + slu@i2r.a-star.edu.sg Abstract Images with text are frequently used on Internet for different purposes. Automatic recognition of text from web images plays an important role on extraction and retrieval of web information. However, the web images are usually in low resolution with artifacts and special effects, which makes word recognition a challenge task even after the text has been localized. In this paper, we propose a robust text recognition technique to efficiently convert the web images into text format. The proposed technique first makes use of the L0 norm smoothing to increase the edge contrast of the input web images. The images are then binarized on each color channel. A connected component analysis is followed to identify the possible character components. Finally the charac- ter candidates are recognized by the OCR engine after skew correction. Extensive experiments have been con- ducted on the latest ICDAR 2011 robust reading com- petition dataset for born-digital text. The experimental results show the superior performance of our proposed technique. 1. Introduction The images on Internet are increasing tremendously during these years. Many of these images contain text information that cannot be found in other places of the web pages [2]. The recognition of the textual informa- tion within web images is very helpful for a better un- derstanding of the contents of web pages. As these im- ages with text embed are used in Internet for different purposes, text recognition in web images can be applied on different kinds of applications, such as web page in- dexing & retrieval, web page content filtering [3]. It will become even more important as the textual information within web images is contributing more and more due (a) (b) (c) (d) Figure 1. Some low quality web image ex- amples to the future network development. Many techniques have been proposed for text ex- traction and recognition on videos and natural scene images [6, 10], but much fewer efforts have been re- ported for the recognition of the text within web im- ages [3, 5]. Compared with other images, web im- ages are often more susceptible to certain specific image degradations including low resolution and small size for faster network transmission rate, computer-generate- character artifacts, and special effects on images for at- tractiveness purpose. As a result, the techniques devel- oped for video/natural scene images often fail to pro- duce satisfactory results when they are directly applied for web images. The latest Robust Reading Competition in Born- Digital Images (Web and Email) held under the frame- work of International Conference on Document Analy- sis and Recognition (ICDAR) 2011 [3] shows current research progress on this area. The contest consists of three tasks, i.e. text localization, text segmentation and word recognition in web image. The third recog- nition task aims to convert the textual information from bitmap format to ASCII format where the text regions 21st International Conference on Pattern Recognition (ICPR 2012) November 11-15, 2012. Tsukuba, Japan 978-4-9906441-1-6 ©2012 IAPR 3042