Script Identiﬁcation from Indian Documents Gopal Datt Joshi, Saurabh Garg, and Jayanthi Sivaswamy Centre for Visual Information Technology, IIIT Hyderabad, India gopal@research.iiit.ac.in, jsivaswamy@iiit.ac.in Abstract. Automatic identiﬁcation of a script in a given document im- age facilitates many important applications such as automatic archiving of multilingual documents, searching online archives of document images and for the selection of script speciﬁc OCR in a multilingual environ- ment. In this paper, we present a scheme to identify diﬀerent Indian scripts from a document image. This scheme employs hierarchical clas- siﬁcation which uses features consistent with human perception. Such features are extracted from the responses of a multi-channel log-Gabor ﬁlter bank, designed at an optimal scale and multiple orientations. In the ﬁrst stage, the classiﬁer groups the scripts into ﬁve major classes using global features. At the next stage, a sub-classiﬁcation is performed based on script-speciﬁc features. All features are extracted globally from a given text block which does not require any complex and reliable seg- mentation of the document image into lines and characters. Thus the proposed scheme is eﬃcient and can be used for many practical appli- cations which require processing large volumes of data. The scheme has been tested on 10 Indian scripts and found to be robust to skew gener- ated in the process of scanning and relatively insensitive to change in font size. This proposed system achieves an overall classiﬁcation accuracy of 97.11% on a large testing data set. These results serve to establish the utility of global approach to classiﬁcation of scripts. 1 Introduction The amount of multimedia data captured and stored is increasing rapidly with the advances in computer technology. Such data include multi-lingual docu- ments. For example, museums store images of all old fragile documents having scientiﬁc or historical or artistic value and written in diﬀerent scripts which are stored in typically large databases. Document analysis systems that help process these stored images is of interest for both eﬃcient archival and to provide access to various researchers. Script identiﬁcation is a key step that arises in document image analysis especially when the environment is multi-script and multi-lingual. An automatic script identiﬁcation scheme is useful to (i) sort document images, (ii) help in selecting appropriate script-speciﬁc OCRs and (iii) search online archives of document image for those containing a particular script. Existing script classiﬁcation approaches can be classiﬁed into two broad cat- egories, namely, local and global approaches. The local approaches analyse a list H. Bunke and A.L. Spitz (Eds.): DAS 2006, LNCS 3872, pp. 255–267, 2006. c  Springer-Verlag Berlin Heidelberg 2006