A Robust Approach for Recognition of Text Embedded in Natural Scenes Jing Zhang 1 , Xilin Chen 2 , Andreas Hanneman 2 , Jie Yang 2 , Alex Waibel 2 1 Mobile Technologies, LLC 2 Interactive Systems Labs, School of Computer Science, Carnegie Mellon University jingzhang@computer.org, {xlchen, Hanneman, yang+, ahw}@cs.cmu.edu Abstract In this paper, we propose a robust approach for recog- nition of text embedded in natural scenes. Instead of using binary information as most other OCR systems do, we extract features from intensity of an image directly. We utilize a local intensity normalization method to effectively handle lighting variations. We then employ Gabor trans- form to obtain local features, and use LDA for selection and classification of features. The proposed method has been applied to a Chinese sign recognition task. The sys- tem can recognize a vocabulary of 3755 Level 1 Chinese characters in the Chinese national standard character set GB2312-80 with various print fonts. We tested the system on 1630 test characters in sign images captured from the natural scenes, and the recognition accuracy is 92.46%. We have already integrated the system into our automatic Chinese sign translation system. 1. Introduction We encounter large amounts of information embedded in natural scenes in our daily lives. Signs are good exam- ples of objects in natural environments that have high information content. A sign is an object that suggests the presence of a fact. It can be a displayed structure bearing letters or symbols, used to identify or advertise a place of business. Signs are everywhere in our lives. They make our lives easier when we are familiar with them, but they may pose problems or even danger when we are not. For example, a foreign tourist might not be able to understand a sign that specifies warnings or hazards. Automatic sign translation, in conjunction with spoken language transla- tion, can help international tourists to overcome these barriers. A successful automatic sign translation system relies on three key technologies: sign detection, OCR (Optical Character Recognition) and machine translation. OCR is one of the most successful areas in the pattern recognition field. For clearly segmented printed materials, state-of-the-art techniques offer virtually error-free OCR for several important alphabetic systems and their vari- ants. However, error rates of OCR systems are still far from that of human readers in many applications such as video OCR and license plate OCR. The gap between the two is exacerbated when the quality of the image is com- promised, e.g., input using a video camera. Video OCR, which is to recognize text from a video stream, was moti- vated by digital library application and visual information retrieval tasks. Many video images contain text contents. These texts can be part of the scene, or may come from computer-generated text, which is overlaid on the image (e.g., captions in broadcast news programs). The text, especially the subtitle in the video, provides useful infor- mation for video indexing. In a video OCR task, text in the foreground is usually uniformly distributed and its resolution can be enhanced by inter-frame information [7, 8, 11]. Compared with video OCR tasks, recognition of text embedded in natural scenes faces more challenges: text can vary in font, size, orientation, and position of text, be blurred by motion, be under shadow, be occluded by other objects, or be distorted by slant and tilt. The image from a camera under unconstrained environment can be very noisy. Character Database Intensity & size nor- malization Gabor transfor- mation LDA fea- ture selec- tion Intensity & size nor- malization Gabor transfor- mation Feature transform ation Classifier Input Image Training Recognition Results Figure 1. The flow chart of the system In this paper, we propose a robust approach for recog- nition of text embedded in natural scenes. Figure 1 illus- trates the procedure of the proposed approach. Text em- bedded in the natural scenes is captured by a camera. The method of character segmentation has been introduced before [2], and we will only focus on the recognition of the character in this paper. Instead of using only binary information as most other OCR systems, we extract fea- tures for recognition from intensity of an image directly [10]. The motivation for this is to avoid potentially infor- mation loss during the binarization process which is irre- versible. We utilize a local intensity normalization method to effectively handle luminance variations of the captured