© 2014, IJARCSSE All Rights Reserved Page | 1060
Volume 4, Issue 5, May 2014 ISSN: 2277 128X
International Journal of Advanced Research in
Computer Science and Software Engineering
Research Paper
Available online at: www.ijarcsse.com
Enhanced Version of Punjabi Stemmer Using Synset
Garima Joshi Kamal Deep Garg
Lovely Professional University, (CSE Dept.) Assistant Professor (CSE Dept.) Lovely Professional
Phagwara- 144411Punjab University, Phagwara-144411 , Punjab
Abstract-Stemming is the process of automatic removal of affixes from the word without doing complete
morphological analysis. In this process the word having the same stem are reduced to the common form. Stemming is
the very first phase of any information retrieval task such as Text Summarization, Word sense disambiguation etc.
Also find its use in the search engine optimization so as to reduce the query processing time. The paper presents the
Enhanced Punjabi Stemmer which is based on hybridization of the two major algorithms used in Punjabi stemmer so
far that are Look up table Algorithm and Rule based algorithm for suffix removal. Synset approach is also
incorporated in the stemmer so as to return the list of words that share the same meaning with the valid stem word. A
large database will be used so as to improve the accuracy level of the Stemmer
Index Terms - Stemmer, synset ,Disambiguation, Suffix Removal, Punjabi Stemmer
I. INTRODUCTION
Stemming is defined as the process of reducing an Inflected word to its stem, base or root form . . Any Natural Language
Processing system requires a stemmer at the very first stage. The basic goal of any stemmer is to standardize the words
by reducing it to the base word. Stemming reduces inflected words to their root forms which are referred as stems for ex
stemmer, stemming, stemmers are all conflated to single root word stem. Stemmer is available in many languages like
English, French, and Arabian and in last few years has been successfully developed for many Indian languages like
Punjabi, Hindi, Marathi, Bengali etc. The first paper on the stemmer was published in 1968 which was written by Julie
Beth Lovins. A later stemmer was written by Martin Porter which was published in the July 1980. (Willett, P. (2006).
This stemmer was very widely used and became the standard algorithm used for English stemming. Most of the
stemming algorithms fall in categories of affix removal algorithms, statistical and mixed algorithms. Affix removal
stemmers apply set of to each word, so as to remove the known prefixes or suffixes. The first such algorithm was given
by J.B. Lovins in 1968. Later some more affix removal algorithms have been suggested. Porter's algorithm published in
1980 was the most frequently used algorithm and the stemming framework Snowball was also developed by Porter.
Stemmers can be broadly classified in two types:
Language Dependent Stemmers
Language Independent Stemmers
Language Dependent Stemmers: Stemmers that are language dependent are made for a specific language. They are
applicable to specific language for which it is designed.
For e.g.: MAULIK which is an effective stemmer for Hindi language. MAULIK is a stemmer designed for only Hindi
language so it is language dependent stemmer.
Language Independent Stemmers: Language independent stemmers are those which do not depend on a specific
language. Language independent stemmers are designed for all languages i.e. it can do stemming any language
For e.g.: Successor variety algorithms are language independent
II. BACKGROUND AND RELATED WORK
The earliest English stemmer was developed by Julie Beth Lovins in 1968. The Porter stemming algorithm (Martin
Porter, 1980), which was published later, is the most widely used algorithm for English stemming. These stemmers are
rule based and are best suited for less inflectional languages like English.
(Mudassar M. Majgaonker, 2010) proposed and evaluated a rule based stemmer for Marathi language. The rule based
approach uses set of suffix removal rules along with an approach which is unsupervised based on n gram splitting
approach which automatically learns suffixes from extracted words of Marathi text. The maximum accuracy achieved for
this stemmer is 82.5%. [6]
(Vishal Gupta et al.,2011) proposed a basic Punjabi stemmer for proper nouns and proper names. The process
of stemming is based on a rule based approach where various different rules are defined depending on the suffixes. When
a suffix of the word matches the suffix rule already defined the corresponding rule will be fired and accordingly the
suffix will be removed and then substituted if required to obtain the stem word.[10]
(Upendra Mishra, 2012) proposed an effective stemmer for Hindi named as “Maulik”. This stemmer is based on
the Hybrid approach combining the suffix removal algorithm and the Brute force algorithm in order to assist the task of
Information retrieval. [7]