Issues in developing LVCSR System for Dravidian Languages: An exhaustive case study for Tamil G. Bharadwaja Kumar Department of Computing Science & Engineering Vellore Institute of Technology, Chennai Campus, Chennai - 600127, India Melvin Jose Johnson Premkumar Department of Computer Science Madras Institute of Technology, Anna University, Chennai - 600044, India ABSTRACT Research in the area of Large Vocabulary Continuous Speech Recognition (LVCSR) for Indian languages has not seen the level of advancement as in English since there is a dearth of large scale speech and language corpora even today. Tamil is one among the four major Dravidian languages spoken in southern India. One of the characteristics of Tamil is that it is morphologically very rich. This quality poses a great challenge for developing LVCSR systems. In this paper, we have analyzed a Tamil corpora of 10 million words and have exhibited the results of a type-token analysis which implies the morphological richness of Tamil. We have demonstrated a grapheme-to-phoneme (G2P) mapping system for Tamil which gives an accuracy of 99.56%. We have shown the impact of important parameters such as absolute beam width, language weight, number of gaussians and the number of senones on speech recognition accuracy for limited vocabulary (3k). We have presented the results of large open vocabulary speech recognition task for vocabulary sizes of 30k, 60k and 100k on the speaker independent task. The Out Of Vocabulary (OOV) rates are 20.2%, 15.8%, 12.8% respectively. The accuracies are 43.59%, 47.11% and 43.52% respectively. General Terms: Computer Science, Artificial Intelligence Keywords: Speech Recognition, Tamil, Sphinx, Large Vocabulary 1. INTRODUCTION Recently there is a growing interest in ASR for Indian languages. Initial work on large vocabulary speech recognition started with Hindi in early years of the previous decade. In [1], the authors have conducted large-vocabulary continuous speech recognition experiments in Hindi using IBM ViaVoice speech recognizer. For a vocabulary size of 65000 words, the system gives a word accuracy of 75% to 95%. In [2], large vocabulary speech recognition for three different languages such as Marathi, Telugu and Tamil on different envi- ronments like land line and cellphone have been conducted. The vocabulary size used in these experiments varies from 14000 to 26000. They have obtained word error rates about 20.7%, 19.4% and 15.4% over land line data and 23.6%, 17.6% and 18.3% over cellphone for Marathi, Tamil and Telugu respectively. [3] used Hidden Markov Model tool kit for Bengali continuous speech recognition. They obtained an average recognition rate of 76.33% for male speakers and 52.34% for female speakers. In [4] the authors investigate the effect of sharing the acoustic models across Tamil and English for effectively modeling the acoustic space of these languages, without having to model each of these languages separately. They conjectured that this had the effect of reducing the computational cost on the search engine as they had used only one acoustic model for many languages. They obtained word recognition accuracy of 61.61% and 64.42% for Tamil without and with adaptation respectively. In [5], the authors have carried out experiments based on word level and triphone models for Tamil speech recognition and achieved 88% accuracy over limited data. They have also tried context independent syllable models for Tamil speech recognition [6] which under-performed when compared to context dependent phone models. There are some attempts to build acoustic models at syllable level for Indian languages. In [7], authors proposed group delay based algorithm to automatically segment and label continuous speech signal into syllable-like units for Indian languages. The syllable recognition performance is about 42.6% and 39.94% for Tamil and Telugu respectively. The new feature extraction technique proposed by them that uses features extracted from multiple frame sizes and frame rates improves recognition performance to 48.7% and 45.36%, for Tamil and Telugu respectively. In [8] an algorithm for segmentation based speech recognition was presented. This approach segments the words from the speech followed by characters from words. Neural networks based on back propagation algorithm was used to train and identify the segmented characters. In [9], a modified version of the text independent phoneme segmentation algorithm proposed by Guido Aversano for their speech recognition experiments has been used. In [10], they analyzed the effect of enhanced morpheme-based trigram model with Katz back-off smoothing when compared to the word-based language models (LMs). The word error rates for word based trigram based models obtained in news and politics domain are 13.8% and 25.04% compared to 12.9% and 23.9% for morph based trigram models. Although many experiments have been conducted to explore conventional approaches like phoneme-based models [6] and