1 LANGUAGE CORPORA: PRESENT INDIAN NEED NILADRI SEKHAR DASH Indian Statistical Institute, Kolkata Email: niladri@isical.ac.in ABSTRACT Corpora have proved their value both in linguistics and language technology. Information obtained from corpora has challenged the intuitive language study, since intuitive observations are found inadequate while compared with findings from corpora. However, the value of corpora is not yet acknowledged in India, although in recent times some sporadic attempts are made for designing corpora in Indian languages. We argue here for initiating large-scale projects to develop corpora of various types in Indian languages not only to contribute in research of language technology, but also to provide reliable language resources for the benefit of people of the country. We plea for the generation of specific types of corpus required for designing tools and systems for language technology linguistics research, and education. 1. INTRODUCTION Utilisation of language corpora in Language technology and linguistics research is an established truth. However, this is yet to culminate in India. Here language technology is in its infancy, linguistic activities are in traditional mode, and language education follows centuries-old pedagogic process. All these pay no attention to language corpora. As a multilingual country, India is a linguistic giant. It preserves large number of living languages of various ethnic and linguistic communities. Due to lack of corpora, these languages suffer from the scarcity of language technological advancements. Generation of corpora could enhance language education to increase literacy rate, save endangered Indian languages from extinction, protect languages, which have lost relevance against the overwhelming aggression of English. Corpora could help these to survive in the battle of linguistic imperialism. In addition, it will supply statistically reliable information to regain their lost ground. In this paper, we seek to draw attention to that fact that India, in comparison to other advanced countries, lags far behind in respect to corpus generation, LT development, corpus-based language research, and education. It becomes more painful when we realise that some Indian languages (e.g. Hindi, Bangla, Urdu, Telugu, Malayalam, Tamil, etc.) have much larger number of speakers than the languages of some advanced countries. We believe that designing tools and systems for language technology in Indian can be best achieved after the generation of corpora in Indian languages. Therefore, we propose for the generation of written and speech corpora in all Indian languages. This proposal is equally relevant to other languages spoken in the South East Asian countries (Bangladesh, Srilanka, Pakistan, Maldives, Nepal, and Bhutan). 2. EARLY ATTEMPTS The first corpus in Indian language is the Kolhapur Corpus of Indian English (KCIE), which is designed by Prof. S.V. Shastri and his colleagues at the Shivaji University, Kolhapur, India in 1988. It contains approximately one million words of Indian English drawn from materials published in the year 1978. This is made to serve for a comparative study among the American, the British, and the Indian English. It has ably projected the independent entity of the Indian English, which is rich with Indian vocabulary and