Application of Planar Motion Segmentation for Scene Text Extraction Tarak Gandhi, Rangachar Kasturi and Sameer Antani Department of Computer Science and Engineering The Pennsylvania State University University Park, PA 16801 gandhi,kasturi,antani @cse.psu.edu Abstract This paper explores an approach for extracting scene text from a sequence of images with relative motion between the camera and the scene. It is assumed that the scene text lies on planar surfaces, whereas the other features are likely to be at random depths or undergoing independent motion. The motion model parameters of these planar surfaces are estimated using gradient based methods, and multiple mo- tion segmentation. The equations of the planar surfaces, as well as the camera motion parameters are extracted by combining the motion models of multiple planar surfaces. This approach is expected to improve the reliability and ro- bustness of the estimates, which are used to perform per- spective correction on the individual surfaces. Perspective correction can lead to improvement in OCR performance. This work could be useful for detecting road signs and bill- boards from a moving vehicle. 1 Introduction There is a considerable amount of text occurring in video that is a useful source of information. The text that occurs naturally in the 3-D scene being imaged is called scene text. The scene text can have any orientation, and its image will be distorted by perspective projection in addition to being subject to the illumination conditions of the scene and sus- ceptible to partial occlusion by other objects. There has been very little research on extracting scene text from gen- eral purpose video. The research that resembles this work the most is on recognition of vehicle license plates [4, 5]. However, these make restrictive assumptions about the text occuring in the scene. Scene text typically exists on a planar surface in a 3-D scene. As the camera or the object moves, the motion of the text features should satisfy planar motion in 3-D. This re- search exploits this property to separate text features from features due to other objects which are likely to be at dif- ferent random depths, and thus do not satisfy the planar constraint. A sequence of images can be used to segment different planar surfaces in the image, determine the model parameters, and remove outliers corresponding to clutter which do not ﬁt any such surface, or is in motion with re- spect to these surface. The model parameters along with their estimated covariances can be used to determine the camera motion in terms of the linear and angular velocity, and the scene structure in terms of the plane normal equa- tions. Since the camera motion parameters are the same for all planar surfaces, these parameters, as well as the plane normals of multiple planar surfaces are combined by using linear and non-linear methods. Using the estimated plane normals, the perspective effect of the camera on the char- acters can be compensated. This step would improve the accuracy of Optical Character Recognition (OCR). 2 Planar Motion Model Let be the 3-D coordinates of a point in the camera coordinate system, in which the axis is the optical axis of the sensor. The perspective projection of the point in the image plane is given by: (1) Let the relative motion between the camera and the scene be modeled by a translational velocity of and a rotational velocity of . If the point lies on a planar surface with a normal along vector with an equation of , the theoretical image motion such a point can be written as: (2) Proceedings of the International Conference on Pattern Recognition (ICPR'00) 1051-4651/00 $10.00 @ 2000 IEEE