Text Based Language Identification System for Indian Languages Following Devanagiri Script Indhuja K, Indu M, Sreejith C M.Tech Computational Linguistics Department of Computer Science and Engineering Government Engineering College Sreekrishnapuram, Palakkad, India P. C. Reghu Raj Professor Department of Computer Science and Engineering Government Engineering College Sreekrishnapuram, Palakkad, India Abstract—: Text based language identification is the task of automatically recognizing a language from a given text of document. It is difficult to discriminate languages within language families than those across families. In this paper, we investigate the performance of statistical measures to determine the text-based language identification system, with an emphasis on five languages used in India based on Devanagiri script - Hindi, Sanskrit, Marathi, Nepali and Bhojpuri. The proposed system uses n-grams as feature for classification. Language Identification is an important pre-processing step in many tasks of Natural Language Processing (NLP). In a multilingual society like India there is wide scope for automatic language identification since it would be a vital step in bridging the digital divide between the Indian masses and the world. Keywords— Devanagiri Script, Multilingual Computing, Natural Language Processing, n-gram Statistics, Text Based Language Identification I. INTRODUCTION Language identification (LID) is an important problem in the field of Natural Language Processing (NLP). With the current spread of internet, text is available in number of languages other than English. The automatic treatment of these texts, for any purpose requiring NLP, such as indexing, interrogation necessitates the primary identification of language. It may seem to be an elementary and simple issue for humans in the real world, but it is difficult for a machine, primarily because different scripts are made up of different shaped patterns to produce different character sets. LID is of special significance especially for multi-lingual country like India. There are a large number of languages used in India, of which twenty two have been given constitutional recognition and are considered major languages [14]. In most cases, frequent code switching and code mixing are also observed. If we could segment multi-lingual documents language-wise, it would be very useful both for exploration of linguistic phenomena, such as code-switching and code mixing, and for computational processing of each segment appropriately. Identification of language from a given small piece of text is therefore an important problem in the Indian context. Devanagari is one of the most used and adopted writing systems in the world. Devanagari script is used for writing languages like Sanskrit, Hindi, Marathi, Nepali, Konkani, Punjabi and many other languages and dialects. One of the popular methods for language identification is the n-gram based method. n-gram method uses letter n-grams representing the frequency of occurrence of various n-letter combinations in a particular language. In n-gram based methods for text based LID, frequency statistics of n-gram occurrence are used as features in classification. The advantage of using n-gram over other methods is that no linguistic knowledge needs to be gathered to construct a classifier. n-gram methods are simple, the accuracy increases with the increasing size of n . Long character strings contain more n-grams and statistical measures can be calculated from it. The number of n-grams in a character string is equal to l- n+1 , where l is the length of string. Our objective is to build a text based language identification system for Indian languages following Devanagiri script. This paper is organized as follows. Section 2 describes detailed literature survey that helps to formulate the problem. In Section 3 an n-gram model proposed for identifying the given language pairs. The experimental details and the results obtained are presented in section 4. Conclusions are given in section 5. Last section includes references. II. LITERATURE SURVEY Lot of research has been carried out in this field and there has been significant progress in this area since last decade. Methods of language identification in practice are Naive Based Classifier, Centric method, Support Vector Machine, Neural Networks, Markov Model etc. Here we discuss some recent studies carried in the field of language identification. Decision trees, Hidden Markov models, Neural Networks and SVMs are tools from more conventional pattern recognition background. Though it may be expected that these classifiers would prove more accurate in the task, published results demonstrate that it is still difficult to outperform the simpler methods. In n-gram based methods for text-based LID, 327 Vol. 3 Issue 4, April - 2014 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 www.ijert.org IJERTV3IS040389