International Journal of Computer Applications (0975 8887) Volume 91 - No. 9, April 2014 Text Extraction from Scene Images through Color Image Segmentation and Statistical Distributions Ranjit Ghoshal St.Thomas’College of Engg.& Tech. Kolkata-700023 Bibhas Chandra Dhara Jadavpur University Kolkata-700032 ABSTRACT This article proposes a scheme for automatic extraction of text from scene images. We proceed by applying statistical features based color image segmentation procedure to the RGB color scene im- age. The segmentation separates out homogenous (in terms of color and brightness) connected components (CCs) from the image. We assume these CCs include text components. So, prime intention of this article is to inspect these CCs in order to identify possible text components. Here, a number of shape based features are defined that distinguishes between text and non-text components. Further, during learning, the distribution of these features are considered in- dependently and approximate them using parametric distribution families. Here, we apply a selection for the best fitted distribu- tion using likelihood criterion. The class (text or non-text) distribu- tion is the multiplication of the corresponding feature distributions. Consequently, during testing, the CC belongs to the class that pro- duces the highest class distribution score. Our experiments are on the database of ICDAR 2011 Born Digital Dataset. We have ob- tained satisfactory performance in distinguishing between text and non-text. Keywords: Scene Image, Color Image Segmentation, Connected Component, Statistical Distributions. 1. INTRODUCTION Automatic recognition of text portions in a natural scene image is useful to blind and foreigners with language barrier. Such a recog- nition methodology should also employ an extraction of text por- tions from the scene images. Moreover, segmentation of such text portions have a crucial impact on document processing, content based image retrieval, robotics and intelligent transport systems. With the growing popularity of various image capturing devices such as digital cameras, mobile phones, PDAs etc, digital images are nowadays easily available. Extraction and recognition of texts from scene images captured by such devices is a challenging prob- lem now-a-days. There have been several studies on text segmenta- tion in the last few years. Wu et al.[9] use a local threshold method to segment texts from gray image blocks containing texts. By con- sidering that texts in images and videos are always colorful, Tsai et al. [8] develop a threshold method using intensity and saturation features to segment texts in color document images. Lienhart et. al. [4] and Sobottka et. al. [7] use color clustering algorithm for text segmentation. In recent years, Jung et al. [3] employed a multi- layer perceptron classifier to discriminate between text and non- text pixels. A sliding window scans the whole image and serves as the input to a neural network. High probability areas inside a probability map are considered as candidate text regions. Wavelet transform has also been applied for text segmentation. In this con- text Gllavata et al. [2] considered wavelet transform and K-means based texture analysis for text detection. Saoi et al. [6] improved the method of Gllavata et al. [2] and applied wavelet transform to all of R, G and B channels of input color image separately. More recently Bhattacharya et al. [1] proposed a scheme based on analysis of con- nected components (CCs) for extraction of Devanagari and Bangla texts from camera captured scene images. Also a few criteria for robust filtering of text components have been studied. In this article we first apply fuzzy c-means based clustering on the color image. With the assumption that text portions are homogeneous in color and lightness, different clusters may contain text portions as dif- ferent connected components. The next step we follow is the study of these connected components. We define some features that are used to distinguish between text and non-text. We consider the text identification as a two class problem. Each class (i.e. text and non- text) is approximated by a combination of feature distribution. In the testing phase, we compare the score of each CC against these two classes. The CC belongs to the class with highest score. Con- cerning the database, we use the public database of ICDAR 2011 Born Digital Dataset. 2. COLOR IMAGE SEGMENTATION Color image segmentation is our first step of text extraction. The fuzzy c-means algorithm is used for color image segmentation. Be- fore applying fuzzy c-means we extract some features from the nor- malized RGB image. Let us consider a pixel p i of the image. Then p i can be described by the tuple (r i , g i , b i ) i.e. the normalized R, G and B values. Besides, these three color values, we take another two statistical features. Each of the statistical features considered here responds differently to different properties of text. Now, in the following we describe each statistical feature. Statistical Feature1(s): Let H is a greylevel histogram over a 7 ×7 window. The variance of H at each pixel is used to measure local information. It is defined as: s = N j=1 (H(j ) - ¯ H) 2 . (1) 5