An Approach to Mixed Language Automatic Speech Recognition Kiran Kumar Bhuvanagiri, Sunil Kumar Kopparapu TCS Innovation Labs - Mumbai Yantra Park, Pokharan Road 2, Thane(West), Maharastra, INDIA {KiranKumar.Bhuvanagiri,SunilKumar.Kopparapu}@TCS.Com Abstract Use of mixed language in day to day spoken speech is becoming common and is being accepted as being syntactically correct. However recognition of mixed language spoken speech is a challenge to a speech recognition engine. Though sparse, there have been studies on how to enable recognition of mixed language spoken speech. At one extreme is to use acoustic models of the complete phone set of the mixed language to enable recognition while on the other extreme is to use a language identification module followed by a language dependent speech recognition engine to recognize mixed language. Each of this has its own implications. In this paper, we approach the problem of mixed language recognition by constraining ourselves to use readily available resources and show that by (a) suitably modifying the language model to use mixed language and (b) by constructing a pronunciation dictionary, one can achieve a good recognition of mixed language spoken speech. 1. Introduction Mixed language arises through the fusion of two or more, usually distinct, source languages, normally in situations of thorough bilingualism, so that it is not possible to classify the resulting language as belonging to either of the language families that were its source [17], [1] [2]. With urbanization and geography shift of people the ability to converse simultaneously in many languages is becoming common. A very large population use mixed language in everyday conversation without actually being aware of its usage, especially the young urban. Though mixed language is defined as a mixture of two distinct languages void of the knowledge of which language is mixed into which, at least in the Indian context, the native language is the primary language and the non-native language (usually English) is the mixed or the secondary language. Primary language can be loosely defined as that language in the mixed language which is spoken in majority. In other words there are a majority of words from that language in a given sentence and a relatively smaller number of words are from the secondary language. One can observe that often the words uttered in the secondary language are keywords or foreign words or phrases which have colloquial acceptance. As a result, the rate of language change within a spoken sentence is very frequent. Thus recognition of mixed language speech requires in our opinion an entirely different approach. Consider a human agent based inquiry service in a metropolitan city. which has to cater to people speaking different languages. In such a scenario, the agent needs to be able to converse (understand and reply) in multiple languages which is very unlikely. A possible solution can be to ascertain the language of the speaker and then direct the call to an agent who can converse in that language expertly. Similarly, a speech solution for multiple languages can be built by developing separate recognition engines of each language. Having identified the language of the speaker, the speaker could be directed to that language specific recognition engine. Clearly, this system though can address multiple languages, it cannot work in the scenario where people used mixed language speech even if one knew the specific mix of the languages in use because the language segments are short and the change very frequent. Recently there has been an increased interest in mixed language recognition (for example [2][3]) although the work in literature is restricted to a mix of Mandarin and Taiwanese. As such work in mixed language speech recognition is in its nascent stages of research and to the best of our knowledge there is no work reported in literature for India specific language mix. There are two major frameworks to build mixed language automatic speech recognition (ML-ASR). One is the multi-pass framework while the other is a one-pass framework. Typically in a multi-pass ML- ASR, the exact instances in spoken speech where