JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 2, ISSUE 1, JULY 2010
32
© 2010 JCSE
http://sites.google.com/site/jcseuk/
Identification of Proverbs in Hindi Text Corpus
and their Translation into Punjabi
Brahmaleen K. Sidhu, Arjan Singh and Vishal Goyal
Abstract— Hindi, the official language of India is spoken by over 500 million people over the world. Punjabi is an Indo-Aryan
language spoken by inhabitants of the historical Punjab region in Pakistan and north western India. Punjabi, the official
language of the Indian state of Punjab, is spoken as a native language by over 2.85% of Indians. This paper describes an
approach to search proverbs in Hindi text corpus, followed by their translation and transliteration into Punjabi language. The
inflected forms of proverbs shall also be identified.
Index Terms— Computational Linguistics, Hindi Proverbs, Machine Translation System, Natural Language Processing,
Transliteration.
—————————— ——————————
1 INTRODUCTION
OTH Hindi and Punjabi languages have originated
from Sanskrit which is one of the oldest languages. In
terms of speakers, Hindi is third most widely spoken
language and Punjabi is twelfth most widely spoken lan-
guage in the world [21]. Hindi is spoken and used by the
people all over the country. Punjabi language is mostly
used in the Northern India and in some areas of Pakistan
as well as in UK, Canada and USA. The script of Hindi is
Devanagri and that of Punjabi is Gurmukhi.
A proverb, also called a byword, adage or nay word, is
defined as a concrete and short saying, which is often
repeated. These statements usually express a truth of
some kind that maybe philosophical, spiritual or then
practical. The word proverb is said to originate from the
Latin word proverbium, meaning concrete statement.
Proverbs may be defined as words collocated together
happen to become fossilized, becoming fixed over time
[1]. Within the area of corpus linguistics, collocation is
defined as a sequence of words or terms which co-occur
more often than would be expected by chance [2].
Proverbs can be classified into various types such as
the metaphorical, maxim or aphorism. A proverb that
describes a basic rule of conduct is known as a maxim. If
a proverb is distinguished by particularly good phrasing,
it may be known as an aphorism [3]. Many writers make
use of a proverb to enhance their work, or then simply to
concretize what they wish to say.
A proverb is a short pithy saying in general use, held
to embody a general truth. It is also called popular say-
ings. In Hindi it is called Kahavat or Kahawat. Some of
the most popular Hindi Proverbs with their meanings are
as follows:
canj D;k tkus vnjd dk Lokn
Bandar kya jaane adark ka swaad
English: What does a monkey know of the taste of ginger?
Meaning: Someone who can't understand can't appreci-
ate.
nwj ds <ksy lq gkous yxrs gS a
Door ke dhol suhavane lagte hain
English: The drums sound better at a distance
Meaning: We tend to like the ones we don't have
,d vkS j ,d X;kjg gksrs gS z
Ek aur ek gyarah hote hain
English: One and one makes eleven
Meaning: There is strength in unity.
This paper describes the research work aimed to iden-
tify proverbs in Hindi text corpus automatically. The
search procedure is followed by their Translation into
Punjabi, i.e., interpreting their meaning and producing of
an equivalent proverb that communicates the same mes-
sage. The proverbs are also transliterated in Punjabi.
Transliteration is the representation of words in the corre-
sponding characters of another alphabet. This problem
comes under the category of Natural Language Pro-
cessing.
Natural language processing (NLP) is a field of com-
puter science and linguistics concerned with the interac-
tions between computers and human (natural) languages.
Natural language generation systems convert information
from computer databases into readable human language.
NLP has significant overlap with the field of computa-
tional linguistics, and is often considered a sub-field of
artificial intelligence. The term natural language is used
to distinguish human languages (such as English, Hindi,
or Punjabi) from formal or computer languages (such as
C++, Java or LISP) [4]. Natural language recognition
seems to require extensive knowledge about the outside
world and the ability to manipulate it. The definition of
„understanding‟ is one of the major problems in natural-
————————————————
Brahmaleen K. Sidhu is with the Punjabi University, Patiala, Punjab,
India.
Arjan Singh is with the Baba Banda Singh Bahadur College of Engi-
neering,Fatehgarh Sahib, Punjab, India.
Vishal Goyal is with the Punjabi University, Patiala, Punjab, India.
B