Incremental Detection of Text on Road Signs from Video with Application to a Driving Assistant System Wen Wu 1 Xilin Chen 2 Jie Yang 1,2 1 Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, U.S.A. 2 Human Computer Interaction Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, U.S.A. {wenwu, xlchen, yang+}@cs.cmu.edu ABSTRACT This paper proposes a fast and robust framework for incrementally detecting text on road signs from natural scene video. The new framework makes two main contributions. First, the framework applies a Divide-and-Conquer strategy to decompose the original task into two sub-tasks, that is, localization of road signs and detection of text. The algorithms for the two sub-tasks are smoothly incorporated into a unified framework through a real time tracking algorithm. Second, the framework provides a novel way for text detection from video by integrating 2D features in each video frame (e.g., color, edges, texture) with 3D information available in a video sequence (e.g., object structure). The feasibility of the proposed framework has been evaluated on the video sequences captured from a moving vehicle. The new framework can be applied to a driving assistant system and other tasks of text detection from video. Categories and Subject Descriptors I.2.10 [Vision and Scene Understanding]: Video analysis; I.4.8 [Scene Analysis]: Color, Motion, Object recognition, Tracking. General Terms Algorithms, Design, Experimentation, Performance. Keywords Incremental text detection, road sign, natural scene video, driving assistant system. 1. INTRODUCTION Automatic detection of text from video is an essential task for many multimedia applications such as video indexing, video understanding, and content-based video retrieval. Extensive research efforts have been directed to the detection, segmentation, and recognition of text from still images and video [1, 2, 7, 9, 10, 12, 18, 20]. In this paper, we focus on the task of automatically detecting text on road signs with application to a driving assistant system. Text on road signs carries much useful information necessary for driving – it provides information for navigation, describes the current traffic situation, defines right-of-way warnings about potential risks, and permits or prohibits certain directions. Automatic detection of text on road signs can help to keep a driver aware of the traffic situation and surrounding environments by highlighting and recalling signs that are ahead and/or have been passed [8]. The system can also read out text on road signs with a synthesized voice, which is especially useful for elderly drivers with weak visual acuity. Such a multimedia system can reduce driver’s cognitive load and enhance safety in driving. Furthermore, it can be combined with other driving navigation and protection devices, e.g., an electric map tool. There are two essential requirements of the proposed framework to improve the safety and efficiency of driving: 1) detecting text on road signs in real-time and 2) achieving high detection accuracy with a low false hit rate. The application scenario is that a video camera is mounted on a moving vehicle to capture the scene in the front of the vehicle. The system attempts to detect text on road signs from video input and assist the driver to maneuver in traffic. Correctly detecting text on road signs imposes many challenges. First, video images are relatively low resolution and noisy. Both background and foreground of a road sign can be very complex and frequently change in video. Lighting conditions are uncontrollable due to time and weather variations. Second, appearance of text can vary due to many different factors, e.g., font, size and color. Also, text can move fast in video and be blurred from motion or occluded by other objects. Third, text can be distorted by the slant, tilt, and shape of signs. In addition to the horizontal left-to-right orientation, other orientations include vertical, circularly wrapped around another object, and even mixed orientations within the same text area. In order to address the above difficulties, we propose a novel framework that can incrementally detect text on road signs from video. The proposed framework takes full advantage of spatio- temporal information in video and fuses partial information for detecting text from frame to frame. The framework employs a two step strategy: 1) locate road signs before detecting text via a plane classification model by using features like discriminative points and color; and 2) detect text within the candidate road sign areas and then fuse the detection results with the help of a feature-based tracker. The concrete steps of the framework are as follows. A set of discriminative points are found for each video frame. Then these selected points are clustered based on local region analysis. Next, a vertical plane criterion is applied to verify road sign areas in video by recovering the orientations of possible planes. Through the sign localization step, thereby, the number of the false positives caused by “text-like” areas is reduced. A multi- scale text detection algorithm is further used to locate text lines within candidate road sign areas. If a text line is detected, a minimum-bounding rectangle (MBR) is fitted to cover it and previously selected points inside MBR are tracked. The Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’04, October 10–16, 2004, New York, New York, USA. Copyright 2004 ACM 1-58113-893-8/04/0010…$5.00. 852 Proceedings of ACM Multimedia 2004, New York, NY, October 10-16, 2004