JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 2, ISSUE 1, JULY 2010 32 © 2010 JCSE http://sites.google.com/site/jcseuk/ Identification of Proverbs in Hindi Text Corpus and their Translation into Punjabi Brahmaleen K. Sidhu, Arjan Singh and Vishal Goyal AbstractHindi, the official language of India is spoken by over 500 million people over the world. Punjabi is an Indo-Aryan language spoken by inhabitants of the historical Punjab region in Pakistan and north western India. Punjabi, the official language of the Indian state of Punjab, is spoken as a native language by over 2.85% of Indians. This paper describes an approach to search proverbs in Hindi text corpus, followed by their translation and transliteration into Punjabi language. The inflected forms of proverbs shall also be identified. Index TermsComputational Linguistics, Hindi Proverbs, Machine Translation System, Natural Language Processing, Transliteration. —————————— —————————— 1 INTRODUCTION OTH Hindi and Punjabi languages have originated from Sanskrit which is one of the oldest languages. In terms of speakers, Hindi is third most widely spoken language and Punjabi is twelfth most widely spoken lan- guage in the world [21]. Hindi is spoken and used by the people all over the country. Punjabi language is mostly used in the Northern India and in some areas of Pakistan as well as in UK, Canada and USA. The script of Hindi is Devanagri and that of Punjabi is Gurmukhi. A proverb, also called a byword, adage or nay word, is defined as a concrete and short saying, which is often repeated. These statements usually express a truth of some kind that maybe philosophical, spiritual or then practical. The word proverb is said to originate from the Latin word proverbium, meaning concrete statement. Proverbs may be defined as words collocated together happen to become fossilized, becoming fixed over time [1]. Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance [2]. Proverbs can be classified into various types such as the metaphorical, maxim or aphorism. A proverb that describes a basic rule of conduct is known as a maxim. If a proverb is distinguished by particularly good phrasing, it may be known as an aphorism [3]. Many writers make use of a proverb to enhance their work, or then simply to concretize what they wish to say. A proverb is a short pithy saying in general use, held to embody a general truth. It is also called popular say- ings. In Hindi it is called Kahavat or Kahawat. Some of the most popular Hindi Proverbs with their meanings are as follows: canj D;k tkus vnjd dk Lokn Bandar kya jaane adark ka swaad English: What does a monkey know of the taste of ginger? Meaning: Someone who can't understand can't appreci- ate. nwj ds <ksy lq gkous yxrs gS a Door ke dhol suhavane lagte hain English: The drums sound better at a distance Meaning: We tend to like the ones we don't have ,d vkS j ,d X;kjg gksrs gS z Ek aur ek gyarah hote hain English: One and one makes eleven Meaning: There is strength in unity. This paper describes the research work aimed to iden- tify proverbs in Hindi text corpus automatically. The search procedure is followed by their Translation into Punjabi, i.e., interpreting their meaning and producing of an equivalent proverb that communicates the same mes- sage. The proverbs are also transliterated in Punjabi. Transliteration is the representation of words in the corre- sponding characters of another alphabet. This problem comes under the category of Natural Language Pro- cessing. Natural language processing (NLP) is a field of com- puter science and linguistics concerned with the interac- tions between computers and human (natural) languages. Natural language generation systems convert information from computer databases into readable human language. NLP has significant overlap with the field of computa- tional linguistics, and is often considered a sub-field of artificial intelligence. The term natural language is used to distinguish human languages (such as English, Hindi, or Punjabi) from formal or computer languages (such as C++, Java or LISP) [4]. Natural language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it. The definition of „understanding‟ is one of the major problems in natural- ———————————————— Brahmaleen K. Sidhu is with the Punjabi University, Patiala, Punjab, India. Arjan Singh is with the Baba Banda Singh Bahadur College of Engi- neering,Fatehgarh Sahib, Punjab, India. Vishal Goyal is with the Punjabi University, Patiala, Punjab, India. B