OPEN FORUM Shrinking digital gap through automatic generation of WordNet for Indian languages Amita Jain · Devendra K. Tayal · Sunny Rai Received: 16 April 2014 / Accepted: 4 July 2014 © Springer-Verlag London 2014 Abstract Hindi ranks fourth in terms of speaker’s size in the world. In spite of that, it has \ 0.1 % presence on web due to lack of competent lexical resources, a key reason behind digital gap due to language barrier among Indian masses. In the footsteps of the renowned lexical resource English WordNet, 18 Indian languages initiated building WordNets under the project Indo WordNet. India is a multilingual country with around 122 languages and 234 mother tongues. Many Indian languages still do not have any reliable lexical resource, and the coverage of numerous WordNets under progress is still far from average value of 25,792. The tedious manual process and high cost are major reasons behind unsatisfactory coverage and limping progress. In this paper, we discuss the socio-cultural and economic impact of providing Internet accessibility and present an approach for the automatic generation of WordNets to tackle the lack of competent lexical resources. Problems such as accuracy, association of linguistics spe- cific gloss/example and incorrect back-translations which arise while deviating from traditional approach of compi- lation by lexicographers are resolved by utilising Wikipedia available for Indian languages. Keywords Computational lexicon · Indian languages · Wikipedia · WordNet · Statistical methods 1 Introduction The expansion of Internet has inter-connected the socio- economic environment of the world and redefined the concept of global culture. Unlike traditional model of office, Internet provides instant information flow, access to remote markets and reduced cost. The contribution of Internet in the world’s economy has reached to 70 % of global GDP, a staggering value in comparison to GDPs of even advanced economies (Manyika and Roxburgh 2011). At present, India, second most populous country of the world has mere 13 % Internet penetration. Issues such as speed, affordability, lack of e-skills and unavailability of functional infrastructure are reasons, which are majorly considered behind digital divide among masses. However, the most fundamental and neglected reason behind inability to access web is lack of acquaintance to interface language. Two out of every three persons in the world do not have access to Internet, and 4.3 million of them are resident of developing countries belonging to poorest strata of society with sparse acquaintance to English. India, second largest English-speaking nation has only 10 % of its population as English speakers. Hindi ranks fourth worldwide with 260 million speakers, i.e. just after English. It is the most spoken language in India with more than 41 % of India’s population as speakers. In spite of that, 56.1 % websites on Internet are in English language, whereas the percentage of websites with content in Hindi is even\ 0.1 %. 1 According to A. Jain (&) · S. Rai Department of Computer Science and Engineering, Ambedkar Institute of Advanced Communication Technologies and Research, Govt. of NCT of Delhi, Delhi, India e-mail: amitajain@aiactr.ac.in D. K. Tayal Department of Computer Science and Engineering, Indira Gandhi Delhi Technical University for Women, Delhi, India 1 Content languages survey on 28th April, 2014: www.w3techs.com/ technologies/overview/content_language/all 123 AI & Soc DOI 10.1007/s00146-014-0548-5