Cepstral Domain Teager Energy for Identifying Perceptually Similar Languages Hemant A. Patil 1 and T.K. Basu 2 1 Dhirubhai Ambani Institute of Information and Communication Technology, DA-IICT , Gandhinagar, Gujarat, India hemant patil@daiict.ac.in 2 Department of Electrical Engineering, Indian Institute of Technology, IIT Kharagpur, West Bengal, India tkb@ee.iitkgp.ernet.in Abstract. Language Identification (LID) refers to the task of identi- fying an unknown language from the test utterances. In this paper, a new feature set, viz.,T-MFCC by amalgamating Teager Energy Opera- tor (TEO) and well-known Mel frequency cepstral coefficients (MFCC) is developed. The effectiveness of the newly derived feature set is demon- strated for identifying perceptually similar Indian languages such as Hindi and Urdu. The modified structure of polynomial classifier of 2 nd and 3 rd order approximation has been used for the LID problem. The results have been compared with state-of-the art feature set, viz.,MFCC and found to be effective (an average jump 21.66%) in majority of the cases. This may be due to the fact that the T-MFCC represents the com- bined effect of airflow properties in the vocal tract (which are known to be language and speaker dependent) and human perception process for hearing. 1 Introduction Language Identification (LID) refers to the task of identifying an unknown lan- guage from the test utterances. LID applications fall into two main categories: pre-processing for machine understanding systems and preprocessing for human listeners. Alternatively, an LID system could be run in advance of the speech recognizer. Alternatively, LID might be used to route an incoming telephone call to a human switchboard operator fluent in the corresponding language [6]. Several techniques such spectral, prosody, phoneme, word-level, etc. have been proposed in the literature for LID problem. In this paper, we adopt spectral- based approach [5] and show the effectiveness of the newly derived feature set,viz.,Teager Energy based Mel Frequency Cepstral Coefficients (T-MFCC) for identification of perceptually similar Indian languages, viz.,Hindi and Urdu. 2 Data Collection and Corpus Design Database of 180 speakers (60 in each of Marathi, Hindi and Urdu) is created from the different states of India, viz.,Maharashtra, Uttar Pradesh and West Bengal A. Ghosh, R.K. De, and S.K. Pal (Eds.): PReMI 2007, LNCS 4815, pp. 455–462, 2007. c Springer-Verlag Berlin Heidelberg 2007