Script Identification from Indian Documents Gopal Datt Joshi, Saurabh Garg, and Jayanthi Sivaswamy Centre for Visual Information Technology, IIIT Hyderabad, India gopal@research.iiit.ac.in, jsivaswamy@iiit.ac.in Abstract. Automatic identification of a script in a given document im- age facilitates many important applications such as automatic archiving of multilingual documents, searching online archives of document images and for the selection of script specific OCR in a multilingual environ- ment. In this paper, we present a scheme to identify different Indian scripts from a document image. This scheme employs hierarchical clas- sification which uses features consistent with human perception. Such features are extracted from the responses of a multi-channel log-Gabor filter bank, designed at an optimal scale and multiple orientations. In the first stage, the classifier groups the scripts into five major classes using global features. At the next stage, a sub-classification is performed based on script-specific features. All features are extracted globally from a given text block which does not require any complex and reliable seg- mentation of the document image into lines and characters. Thus the proposed scheme is efficient and can be used for many practical appli- cations which require processing large volumes of data. The scheme has been tested on 10 Indian scripts and found to be robust to skew gener- ated in the process of scanning and relatively insensitive to change in font size. This proposed system achieves an overall classification accuracy of 97.11% on a large testing data set. These results serve to establish the utility of global approach to classification of scripts. 1 Introduction The amount of multimedia data captured and stored is increasing rapidly with the advances in computer technology. Such data include multi-lingual docu- ments. For example, museums store images of all old fragile documents having scientific or historical or artistic value and written in different scripts which are stored in typically large databases. Document analysis systems that help process these stored images is of interest for both efficient archival and to provide access to various researchers. Script identification is a key step that arises in document image analysis especially when the environment is multi-script and multi-lingual. An automatic script identification scheme is useful to (i) sort document images, (ii) help in selecting appropriate script-specific OCRs and (iii) search online archives of document image for those containing a particular script. Existing script classification approaches can be classified into two broad cat- egories, namely, local and global approaches. The local approaches analyse a list H. Bunke and A.L. Spitz (Eds.): DAS 2006, LNCS 3872, pp. 255–267, 2006. c Springer-Verlag Berlin Heidelberg 2006