Curvature Feature Distribution based Classification of Indian Scripts from Document Images Gurav Sharma * Multimedia Laboratory Indian Institute of Technology Delhi grvsharma@gmail.com Ritu Garg Multimedia Laboratory Indian Institute of Technology Delhi ritu2721a@gmail.com Santanu Chaudhury Multimedia Laboratory Indian Institute of Technology Delhi schaudhury@gmail.com ABSTRACT We present a framework for classification of text document images based on their script. We deal with the domain of Indian scripts which has high inter script similarities. Indian scripts have characteristic curvature distributions which help in visual discrimination of scripts. We use edge direction based features to capture the distribution of curvature. We also use a recently proposed feature selection algorithm to obtain the most discriminating curvature features. We form hierarchy (automatically) based on statistical distances be- tween the script models. Hierarchy allows us to group sim- ilar scripts at one level and then focus on the classification between the similar scripts at the next level leading to im- provement in accuracy. We show experiments and results on a large set of about 3400 images. Categories and Subject Descriptors I.7.0 [Computing Methodologies]: Document And Text Processing—General ; 1.5.4 [Computing Methodologies]: Pattern Recognition—Application General Terms Text Document Image Classification System Keywords Indic script image identification, statistical modeling 1. INTRODUCTION In this paper we present a framework to address the prob- lem of script classification from document images. We create * GS was a Masters student at IIT Delhi. RG is Project Scientist at Multimedia Lab, IIT Delhi. SC is Schlumberger Chair Professor in Electrical Engg. De- partment at IIT Delhi. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00. statistical models of each script class based on a set of train- ing images. The features used are motivated by the require- ment of capturing the curvature distributions of the scripts. A recently proposed feature selection algorithm gives us the most discriminating features and also helps in reducing the dimension of the feature space. Our algorithm is fast in prac- tice with good accuracy. We also incorporate hierarchy in the model based on statistical distances between the models for scripts. This leads to the grouping of similar scripts (the similarity here is in terms of the statistical models, which we find to be visually acceptable in experiments) at one level and then allows us to focus on discriminating between them in the next level. Since the training for discrimination is more focused, at the lower level, on smaller number of scripts instead of all the scripts, it leads to better performance. Automatic script identification is an important step to- wards many high level tasks. It can be used for many tasks e.g. for managing large document image collections by sort- ing based on the scripts, as a preprocessing step in character recognition systems, for searching and retrieval of document image databases etc. The domain of Indian scripts has its own distinct characteristics. Indian scripts can be visually discriminated by observing the curliness of the script. While some of the scripts have Shiro-rekha (horizontal line at the top of the word) along with many dominant vertical strokes e.g. Fig. 1(a), some have predominantly curved symbols with very less straight lines e.g. Fig. 1(b). We also consider En- glish as one of the script class, which has dominant straight lines with certain amount of curves making the overall cur- vature distribution different. We use this observation as a motivation to work with edge direction based features to capture the distribution of curvature in the scripts. The same argument extends to: not all curvature directions will be equally discriminating for classification. To obtain the most discriminating curvature directions, we employ a re- cently proposed information theoretic feature selection al- gorithm [10]. This feature selection algorithm not only ex- ploits the dependence (captured using information theoretic mutual information based formulation) of individual feature values on the class label but also the dependence of observ- ing multiple features together with the class labels. The paper is organized as follows. First we give a brief sur- vey of the related art in Sec. 2. We then proceed to discuss our method in detail in the next Sec. 3, describing each part; features used in Sec. 3.1, feature selection algorithm used in Sec. 3.2, statistical model used for classification in Sec. 3.3 and the hierarchical framework in Sec. 3.4. We then show the experiments we conducted to validate our framework in