Details on Stemming in the Language Modeling Framework James Allan and Giridhar Kumaran Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst Amherst, MA 01003, USA {allan,giridhar}@cs.umass.edu ABSTRACT We incorporate stemming into the language modeling frame- work. The work is suggested by the notion that stemming increases the numbers of word occurrences used to estimate the probability of a word (by including the members of its stem class). As such, stemming can be viewed as a type of smoothing of probability estimates. We show that such a view of stemming leads to a simple incorporation of ideas from corpus-based stemming. We also present two genera- tive models of stemming. The ﬁrst generates terms and then variant stems. The second generates stem classes and then a member. All models are evaluated empirically, though there is little diﬀerence between the various forms of stemming. 1. INTRODUCTION Stemming is the process of collapsing words into their morphological root. For example, the terms addicted, ad- dicting, addiction, addictions, addictive, and addicts might be conﬂated to their stem, addict. In information retrieval (IR) systems, stemming serves to aid one or both of: • eﬃciency—limiting the number of unique words re- duces the size of an IR system’s dictionary, can im- prove compression rates [11], etc. • eﬀectiveness—in theory, stemming improves a system’s recall of relevant material. Documents that contain morphological variants of query words have a good chance of also being relevant. Over the years, numerous studies and countless classroom projects have explored the eﬀectiveness issues of stemming from almost any angle imaginable: should stemming be used at all, how can stemmers be improved, what advantages do diﬀerent stemmers provide, how can stemming be done in new languages, and so on. In this study, we step back from that style of experiment somewhat and explore the question of what precisely stem- CIIR Technical Report No. IR-289 ming accomplishes. We are motivated by the observation 1 that stemming can be viewed as a form of smoothing, as a way of improving statistical estimates. If the observation is correct, then it may make sense to incorporate stemming directly into a language model rather than treating it as an external process to be ignored or used as a pre-processing step. Further, it may be that viewing stemming in this light will illuminate some of its properties or suggest alternate ways that stemming could be used. In this study, we tackle that problem, ﬁrst by convert- ing classical stemming into a language modeling framework and then showing that used that way, it really does look like other types of smoothing. This view of stemming will suggest an obvious extension that begs being merged with ideas from corpus-based stemming. Once the idea of stemming is embedded in the language modeling framework, other ways that it can be included start suggesting themselves. We will brieﬂy touch on two ways of viewing stemming as a generative process rather than merely a technique to improve statistical estimates. The focus of this work is on a diﬀerent way to view stem- ming. However, we will also report on a series of experiments that evaluate the eﬀectiveness of diﬀerent models along the way. The experiments will show modest, but rarely statis- tically signiﬁcant, improvements in comparison to the sim- plest form of stemming. All forms of stemming will result in better accuracy than omitting stemming. The following section reviews related work in stemming. In Section 3 we brieﬂy review some ideas from language modeling, probability estimation, and smoothing that are central to this paper. We then describe in Section 4 the experimental setup in which we carried out our empirical validations. The core of the paper starts in Section 5 where we incorporate stemming into the language modeling frame- work and then, in Section 6, brieﬂy ﬂirt with the idea of partial stemming suggested by doing so. We then show in Section 7 how stemming can be treated as a form of smooth- ing. That leads to the obvious idea discussed in Section 8 of allowing diﬀerent words to contribute diﬀerently to the smoothing of a word’s probability. Then in Section 9 we switch gears to develop and evaluate two generative models of stemming, one that is similar in spirit to a translation model. We summarize our ﬁndings in Section 10. 2. PREVIOUS RESEARCH 1 An observation that was ﬁrst expressed to us by Jay Ponte at a May/June 2001 workshop on language modeling for information retrieval.