OPEN FORUM
Shrinking digital gap through automatic generation of WordNet
for Indian languages
Amita Jain · Devendra K. Tayal · Sunny Rai
Received: 16 April 2014 / Accepted: 4 July 2014
© Springer-Verlag London 2014
Abstract Hindi ranks fourth in terms of speaker’s size in
the world. In spite of that, it has \ 0.1 % presence on web
due to lack of competent lexical resources, a key reason
behind digital gap due to language barrier among Indian
masses. In the footsteps of the renowned lexical resource
English WordNet, 18 Indian languages initiated building
WordNets under the project Indo WordNet. India is a
multilingual country with around 122 languages and 234
mother tongues. Many Indian languages still do not have
any reliable lexical resource, and the coverage of numerous
WordNets under progress is still far from average value of
25,792. The tedious manual process and high cost are
major reasons behind unsatisfactory coverage and limping
progress. In this paper, we discuss the socio-cultural and
economic impact of providing Internet accessibility and
present an approach for the automatic generation of
WordNets to tackle the lack of competent lexical resources.
Problems such as accuracy, association of linguistics spe-
cific gloss/example and incorrect back-translations which
arise while deviating from traditional approach of compi-
lation by lexicographers are resolved by utilising
Wikipedia available for Indian languages.
Keywords Computational lexicon · Indian
languages · Wikipedia · WordNet · Statistical
methods
1 Introduction
The expansion of Internet has inter-connected the socio-
economic environment of the world and redefined the concept
of global culture. Unlike traditional model of office, Internet
provides instant information flow, access to remote markets
and reduced cost. The contribution of Internet in the world’s
economy has reached to 70 % of global GDP, a staggering
value in comparison to GDPs of even advanced economies
(Manyika and Roxburgh 2011). At present, India, second
most populous country of the world has mere 13 % Internet
penetration. Issues such as speed, affordability, lack of e-skills
and unavailability of functional infrastructure are reasons,
which are majorly considered behind digital divide among
masses. However, the most fundamental and neglected reason
behind inability to access web is lack of acquaintance to
interface language. Two out of every three persons in the
world do not have access to Internet, and 4.3 million of them
are resident of developing countries belonging to poorest
strata of society with sparse acquaintance to English. India,
second largest English-speaking nation has only 10 % of its
population as English speakers. Hindi ranks fourth worldwide
with 260 million speakers, i.e. just after English. It is the most
spoken language in India with more than 41 % of India’s
population as speakers. In spite of that, 56.1 % websites on
Internet are in English language, whereas the percentage of
websites with content in Hindi is even\ 0.1 %.
1
According to
A. Jain (&) · S. Rai
Department of Computer Science and Engineering, Ambedkar
Institute of Advanced Communication Technologies and
Research, Govt. of NCT of Delhi, Delhi, India
e-mail: amitajain@aiactr.ac.in
D. K. Tayal
Department of Computer Science and Engineering, Indira
Gandhi Delhi Technical University for Women, Delhi, India
1
Content languages survey on 28th April, 2014: www.w3techs.com/
technologies/overview/content_language/all
123
AI & Soc
DOI 10.1007/s00146-014-0548-5