Corpora Creation for Indian Language Technologies – The ILCI Project Akanksha Bansal, Esha Banerjee and Girish Nath Jha Jawaharlal Nehru University New Delhi, India {akanksha.bansal15, esha.jnu, girishjha}@gmail.com Abstract This paper presents an overview of corpus classification and development in electronic format for 16 language-pairs, with Hindi as the source language. In a multi-lingual country like India, the major thrust in language technology lies in providing inter-communication services and direct information access in one’s own language. As a result, language technology in India has seen major developments over the last decade in terms of machine translation and speech synthesis systems. As deeper research advances, the need for high quality standardised corpus is being seen as a primary challenge. To address these needs, the government of India has initiated a mega project called the Indian Languages Corpora Initiative (ILCI) to collect parallel annotated corpus in 17 scheduled languages of the Indian constitution. The project is in its second phase currently, within which it aims to collect 8,50,000 parallel annotated sentences in 17 Indian languages in the domains of Entertainment and Agriculture. Together with the 6,00,000 parallel sentences collected in Phase 1 in the domains of Health and Tourism (Choudhary & Jha, 2011), The corpus being developed is one of the largest known parallel annotated corpora for any Indian language till date. This phase will ultimately also see the development of chunking standards for processing the annotated corpus. Keywords: Corpus creation, Natural Language Processing, Indian language technologies, ILCI, Domain Classification 1. Introduction India is home to over 780 1 languages of five language families (Indo-Aryan, Dravidian, Tibeto- Burman, Austro Asiatic, Andamanese (Abbi 2001)) with the possibility of a sixth language - Great Andamanese (Abbi 2008) being added and, as such, presents rich linguistic diversity. The Eighth Schedule of the Indian Constitution lists 22 languages (called Scheduled languages) which are representative of the state languages and are part of the Official Languages Commission. Hindi, having a speaking population of over 40% of the nation 2 , enjoys the status of the official language of India, with English as an associate official language 3 . Indian Languages Corpora Initiative (ILCI) is an initiative by the Technology Development for Indian Languages (TDIL) 4 programme funded by the Department of Electronics and Information Technology (DeitY) 5 , Government of India, to help develop language technology applications for people in their own language within a multilingual society like India. After the Unicode standards in Indian languages came up, many texts have been converted into electronic format but a large repository of texts is still unavailable for creation of parallel corpora. 1 Based on an article published by The Hindu ( July 22, 2013) on The People’s linguistic Survey of India(PLSI) conducted by G.N. Devy, initiated by Bhasha Research and Publication Centre. 2 Census of India 2001 3 The Official Languages Act, 1963. 4 Official website: tdil.mit.gov.in 5 deity.gov.in/content/rd-indian-language-technology Much of the corpus creation work in India has been for short term project goals and the only available corpora available in the public domain are EMILLE parallel corpus (Baker, P et al., 2004) and the English-Indian languages parallel corpus developed by CIIL, Mysore (available with LDC-IL). Therefore, the ILCI corpus developed in Phase 1 of the project, with 9,600,000 words in 12 languages, stands as the largest parallel annotated corpus available till date. 2. ILCI project The ILCI project proposes to build a multilingual parallel corpus in 23 scheduled languages, including English, with Hindi as the source language. The project began in 2010 and completed its first phase in April 2012 with parallel annotated (part-of-speech annotation) corpora created in 12 languages for two domains – Health and Tourism (Narayan, 2011). Translation was done manually and annotation was done semi-automatically by linguists of every language. For annotation, the Part-Of-Speech Tagset standardized by the Bureau of Indian Standards 6 (2010) has been followed. Java based tools aided linguists in efficient translation and annotation of corpus. The corpus is freely available for download for research purpose 7 . The project is currently running in its second phase with data being collected in 17 languages in the domains of Agriculture and Entertainment. With 25,000 sentences in every language in each domain, a total of 17,00,000 parallel annotated sentences are expected to be collected at the end of the second 6 www.bis.org.in 7 www.tdil.org