Cepstral Domain Teager Energy for Identifying Perceptually Similar Languages Hemant A. Patil 1 and T.K. Basu 2 1 Dhirubhai Ambani Institute of Information and Communication Technology, DA-IICT , Gandhinagar, Gujarat, India hemant patil@daiict.ac.in 2 Department of Electrical Engineering, Indian Institute of Technology, IIT Kharagpur, West Bengal, India tkb@ee.iitkgp.ernet.in Abstract. Language Identiﬁcation (LID) refers to the task of identi- fying an unknown language from the test utterances. In this paper, a new feature set, viz.,T-MFCC by amalgamating Teager Energy Opera- tor (TEO) and well-known Mel frequency cepstral coeﬃcients (MFCC) is developed. The eﬀectiveness of the newly derived feature set is demon- strated for identifying perceptually similar Indian languages such as Hindi and Urdu. The modiﬁed structure of polynomial classiﬁer of 2 nd and 3 rd order approximation has been used for the LID problem. The results have been compared with state-of-the art feature set, viz.,MFCC and found to be eﬀective (an average jump 21.66%) in majority of the cases. This may be due to the fact that the T-MFCC represents the com- bined eﬀect of airﬂow properties in the vocal tract (which are known to be language and speaker dependent) and human perception process for hearing. 1 Introduction Language Identiﬁcation (LID) refers to the task of identifying an unknown lan- guage from the test utterances. LID applications fall into two main categories: pre-processing for machine understanding systems and preprocessing for human listeners. Alternatively, an LID system could be run in advance of the speech recognizer. Alternatively, LID might be used to route an incoming telephone call to a human switchboard operator ﬂuent in the corresponding language [6]. Several techniques such spectral, prosody, phoneme, word-level, etc. have been proposed in the literature for LID problem. In this paper, we adopt spectral- based approach [5] and show the eﬀectiveness of the newly derived feature set,viz.,Teager Energy based Mel Frequency Cepstral Coeﬃcients (T-MFCC) for identiﬁcation of perceptually similar Indian languages, viz.,Hindi and Urdu. 2 Data Collection and Corpus Design Database of 180 speakers (60 in each of Marathi, Hindi and Urdu) is created from the diﬀerent states of India, viz.,Maharashtra, Uttar Pradesh and West Bengal A. Ghosh, R.K. De, and S.K. Pal (Eds.): PReMI 2007, LNCS 4815, pp. 455–462, 2007. c  Springer-Verlag Berlin Heidelberg 2007