Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), IJCNLP 2011, , pages 35–39, Chiang Mai, Thailand, November 8, 2011. Punjabi Language Stemmer for nouns and proper names Vishal Gupta Assistant Professor, UIET Panjab University Chandigarh Vishal_gupta100@yahoo.co.in Gurpreet Singh Lehal Professor, Department of Computer Science, Punjabi University Patiala gslehal@yahoo.com Abstract This paper concentrates on Punjabi language noun and proper name stemming. The purpose of stemming is to obtain the stem or radix of those words which are not found in dictionary. If stemmed word is present in dictionary, then that is a genuine word, otherwise it may be proper name or some invalid word. In Punjabi language stemming for nouns and proper names, an attempt is made to obtain stem or radix of a Punjabi word and then stem or radix is checked against Punjabi noun and proper name dictionary. An in depth analysis of Pun- jabi news corpus was made and various possi- ble noun suffixes were identified like ੀਆਂ īāṃ, ਿੀਆਂ iāṃ, ੀਆਂ ūāṃ, ੀੀਂ āṃ, ੀਏ īē etc. and the various rules for noun and proper name stemming have been generated. Punjabi language stemmer for nouns and proper names is applied for Punjabi Text Summarization. The efficiency of Punjabi language noun and Proper name stemmer is 87.37%. 1 Introduction stemming is the process for reducing inflected or sometimes derived words to their stem, base or root form, generally a written word form. The stem need not be identical to the morphological root of the word, it is usually sufficient that relat- ed words map to the same stem, even if this stem is not in itself a valid root. A stemmer for Eng- lish, for example, should identify the string cats and possibly catlike, catty etc. as based on the root cat, and stemmer, stemming, stemmed as based on stem. A stemming algorithm reduces the words fishing, fished, fish, and fisher to the root word, fish. Stemming is an operation that conflates morphologically similar terms into a single term without doing complete morphologi- cal analysis. Stemming (Haidar et al., 2006) is used in information retrieval systems to improve performance. Additionally, this operation reduc- es the number of terms in the information re- trieval system, thus decreasing the size of the index files. In Punjabi language stemming (Mandeep et al.,2009) for nouns and proper names, an attempt is made to obtain stem or radix of a Punjabi word and then stem or radix is checked against Punjabi noun morph and proper names list. An in depth analysis of Punjabi news corpus was made and various possible noun suffixes were identified like ੀਆਂ īāṃ, ਿੀਆਂ iāṃ, ੀ ਆਂ ūāṃ, ੀੀਂ āṃ, ੀਏ īē etc. and the various rules for noun and proper name stemming have been generated. Punjabi language stemmer for nouns and proper names is applied for Punjabi Text Summarization. Text Summarization is the process of condensing the source text into shorter version. Those sentences containing Punjabi language nouns or proper names are important. 2 Background and Related Work The earliest English stemmer was developed by Julie Beth Lovins in 1968. The Porter stemming algorithm (Martin Porter, 1980), which was pub- lished later, is perhaps the most widely used al- gorithm for English stemming. Both of these stemmers are rule based and are best suited for less inflectional languages like English. (Gold- smith, 2001) proposed an algorithm for the mor- phology of a language based on the minimum description length (MDL) framework which fo- cuses on representing the data in as compact manner as possible. (Creutz, 2005) uses probabil- istic maximum a posteriori (MAP) formulation for morpheme segmentation. Not much work has been reported for stem- ming for Indian languages compared to English and other European languages. The earliest work reported by (Ramanathan and Rao, 2003) used a hand crafted suffix list and performed longest match stripping for building a Hindi stemmer. (Majumder et al., 2007) developed statistical ap- proach YASS: Yet Another Suffix Stripper which uses a clustering based approach based on string distance measures and requires no linguis- 35