International Journal of Computer Applications (0975 – 8887) Volume 94 – No 13, May 2014 36 An Improved Rule b ased Iterative Affix Stripping Stemmer for Tamil Language using K-Mean Clustering M.Kasthuri, Asst.Professor, Dept. of Com.Science, Bishop Heber College (Autonomous), Tiruchirappalli, Tamil Nadu, India S.Britto Ramesh Kumar, Asst.Professor, Dept. of Com.Science, St.Joseph’s College (Autonomous), Tiruchirappalli, Tamil Nadu, India ABSTRACT Stemming is an important step in many of the Information Retrieval (IR) and Natural Language Processing (NLP) tasks. Stemming is usually done by removing any attached suffixes and prefixes (affixes) from index terms before the actual assignment of the term to the index. Stemming is a pre-processing step in Text Mining applications and basic requirement for many areas such as computational linguistics and information retrieval work for improving their recall performance. This paper proposes improved rule based iterative affix stripping algorithm for getting stemmed Tamil word with less computational steps. Further K-Means clustering algorithm utilized to cluster the stemmed Tamil Words in order to improve the performance of Tamil language Information Retrieval and Extraction. The experimental analysis clearly shows that the words stemmed after clustering gives better result compared to words stemmed before clustering. Keywords Tamil morphology; Transliteration; Tamil stemmer; Improved affix stemmer; Natural Language Processing 1. INTRODUCTION A process that attempts to map a derived form of word to its root is referred as stemmer. For example words such as tests, tested and testing all will reduce to stem word “test”. Stemming plays an important role in Information Retrieval System for improving their performance [1]. For example when the user enters the query word computing, user most likely wants to retrieve documents containing the terms computer and computation as well. Thus using stemmer user can improves recall performance reducing the size of the index as user need not index all the morphological variants of a word. Since many terms are mapped to one. This is especially true in case of a morphologically rich language like Tamil, where a single word may take many forms. The aim of the stemming algorithm is to ensure that related words are mapped to common stem. Stemmers for different languages have been developed and evaluated for various Indian languages such as Hindi, Gujarathi, Punjabi, Bengali, Urdu, Marathi, Malayalam, Kannada etc [1-7] in the recent years. This paper proposes an improved iterative rule based affix stripping stemmer for Tamil language with K-Mean cluster technique. An overview of the proposed model is projected in Figure 1. 2. RELATED WORK Stemmer was primarily developed for English Language; there was an increased demand from the research community to develop stemmers for other languages. But such studies on Indian languages are quite limited. The earliest work reported by Ramanathan and Rao [1] to perform longest match stripping for building a Hindi stemmer. Juhi Ameta et al. developed a light stemmer for Gujarathi language [2] in 2011 for removing inflectional and derivational endings. Then similar research work had started for other language such as Bengali [4], Urdu [5] , Malayalam [7] and Punjabi [3]. There is a paper published by Vivek Anandan Ramachandran and Ilango Krishnamurthi [15] on an iterative suffix stripping stemming algorithm for Tamil. Steinbach and et al. developed a comparison of document clustering techniques [17] for improving the English document clustering technique and used it for information retrieval system. The work reported by M.Thangaraju et.al., and Dr.R.Manavalan in 2013 to perform suffix stripping stemming with clustering analysis [16, 17]. However, this section expresses the research experience in developing improved iterative rule based affix stripping Tamil stemmer with cluster technique. 3. TAMIL LANGUAGE Tamil is a Dravidian language, mainly spoken predominantly by Tamil people of Indian subcontinent. Tamil words have more derivational forms than English words. Tamil word consists of a stem word attached to zero or more derivational prefix and zero or one suffix, which together form a word. Tamil is a morphologically rich language so Tamil Language has very high inflectional forms. Normally most of the Tamil words have more than one morphological suffix. The number of suffix is ranging from 3 to 13. Tamil is the agglutinative language. One or more affixes are attached to the Tamil lexical root word. Most of the Tamil words affixes are suffixes. Suffixes of the Tamil Language can be derivational suffixes or inflectional suffixes. Derivational suffixes are either changes the part of speech of the word or its meaning. Inflectional suffixes are attached at the end of the root word. Proposed stemming algorithm for Tamil is used to strip extra constituents’ available at prefixes and suffixes, and map them to a stem corresponding to the root word. 4. PRE-PROCESSING Pre-processing has been traditional in setting up Information Retrieval System (IRS) to discard the stop words during indexing. The stop word list connects in various ways with the stemming algorithm. The stemming algorithm can itself be used to detect and remove stop words. Stop words could be removed before the stemming algorithm is applied. A stemming algorithm is a process of linguistic normalization to strip unwanted constituents available at prefix or suffix of the stem word. This research work mainly deals with the problem of plural resolutions in Tamil language.