© 2014, IJARCSSE All Rights Reserved Page | 1060 Volume 4, Issue 5, May 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Enhanced Version of Punjabi Stemmer Using Synset Garima Joshi Kamal Deep Garg Lovely Professional University, (CSE Dept.) Assistant Professor (CSE Dept.) Lovely Professional Phagwara- 144411Punjab University, Phagwara-144411 , Punjab Abstract-Stemming is the process of automatic removal of affixes from the word without doing complete morphological analysis. In this process the word having the same stem are reduced to the common form. Stemming is the very first phase of any information retrieval task such as Text Summarization, Word sense disambiguation etc. Also find its use in the search engine optimization so as to reduce the query processing time. The paper presents the Enhanced Punjabi Stemmer which is based on hybridization of the two major algorithms used in Punjabi stemmer so far that are Look up table Algorithm and Rule based algorithm for suffix removal. Synset approach is also incorporated in the stemmer so as to return the list of words that share the same meaning with the valid stem word. A large database will be used so as to improve the accuracy level of the Stemmer Index Terms - Stemmer, synset ,Disambiguation, Suffix Removal, Punjabi Stemmer I. INTRODUCTION Stemming is defined as the process of reducing an Inflected word to its stem, base or root form . . Any Natural Language Processing system requires a stemmer at the very first stage. The basic goal of any stemmer is to standardize the words by reducing it to the base word. Stemming reduces inflected words to their root forms which are referred as stems for ex stemmer, stemming, stemmers are all conflated to single root word stem. Stemmer is available in many languages like English, French, and Arabian and in last few years has been successfully developed for many Indian languages like Punjabi, Hindi, Marathi, Bengali etc. The first paper on the stemmer was published in 1968 which was written by Julie Beth Lovins. A later stemmer was written by Martin Porter which was published in the July 1980. (Willett, P. (2006). This stemmer was very widely used and became the standard algorithm used for English stemming. Most of the stemming algorithms fall in categories of affix removal algorithms, statistical and mixed algorithms. Affix removal stemmers apply set of to each word, so as to remove the known prefixes or suffixes. The first such algorithm was given by J.B. Lovins in 1968. Later some more affix removal algorithms have been suggested. Porter's algorithm published in 1980 was the most frequently used algorithm and the stemming framework Snowball was also developed by Porter. Stemmers can be broadly classified in two types: Language Dependent Stemmers Language Independent Stemmers Language Dependent Stemmers: Stemmers that are language dependent are made for a specific language. They are applicable to specific language for which it is designed. For e.g.: MAULIK which is an effective stemmer for Hindi language. MAULIK is a stemmer designed for only Hindi language so it is language dependent stemmer. Language Independent Stemmers: Language independent stemmers are those which do not depend on a specific language. Language independent stemmers are designed for all languages i.e. it can do stemming any language For e.g.: Successor variety algorithms are language independent II. BACKGROUND AND RELATED WORK The earliest English stemmer was developed by Julie Beth Lovins in 1968. The Porter stemming algorithm (Martin Porter, 1980), which was published later, is the most widely used algorithm for English stemming. These stemmers are rule based and are best suited for less inflectional languages like English. (Mudassar M. Majgaonker, 2010) proposed and evaluated a rule based stemmer for Marathi language. The rule based approach uses set of suffix removal rules along with an approach which is unsupervised based on n gram splitting approach which automatically learns suffixes from extracted words of Marathi text. The maximum accuracy achieved for this stemmer is 82.5%. [6] (Vishal Gupta et al.,2011) proposed a basic Punjabi stemmer for proper nouns and proper names. The process of stemming is based on a rule based approach where various different rules are defined depending on the suffixes. When a suffix of the word matches the suffix rule already defined the corresponding rule will be fired and accordingly the suffix will be removed and then substituted if required to obtain the stem word.[10] (Upendra Mishra, 2012) proposed an effective stemmer for Hindi named as “Maulik”. This stemmer is based on the Hybrid approach combining the suffix removal algorithm and the Brute force algorithm in order to assist the task of Information retrieval. [7]