Statistical vs. Rule-Based Stemming for Monolingual French Retrieval Prasenjit Majumder 1 Mandar Mitra 1 Kalyankumar Datta 2 1 CVPR Unit, Indian Statistical Institute, Kolkata 2 Dept. of EE, Jadavpur University, Kolkata {prasenjit t,mandar}@isical.ac.in, kalyandatta@debesh.wb.nic.in Abstract This paper describes our approach to the 2006 Adhoc Monolingual Information Re- trieval run for French. The goal of our experiment was to compare the performance of a proposed statistical stemmer with that of a rule-based stemmer, specifically the French version of Porter’s stemmer. The statistical stemming approach is based on lexicon clustering, using a novel string distance measure. We submitted three official runs, besides a baseline run that uses no stemming. The results show that stem- ming significantly improves retrieval performance (as expected) by about 9-10%, and the performance of the statistical stemmer is comparable with that of the rule-based stemmer. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Managment]: Languages—Query Languages General Terms Performance, Experimentation Keywords statistical stemming, string distance, clustering, Porter’s algorithm, monolingual information re- trieval 1 Introduction We have recently been experimenting with languages that have not been studied much from the IR perspective. These languages are typically resource-poor, in the sense that few language resources or tools are available for them. As a specific example, no comprehensive stemming algorithms are available for these languages. The stemmers that are available for more widely studied languages (e.g. English) usually make use of an extensive set of linguistic rules. Rule based stemmers for most resource-poor languages are either unavailable or lack comprehensive coverage. In earlier work, therefore, we have looked at the problem of stemming for such resource-poor languages, and proposed a stemming approach that is based on purely unsupervised clustering techniques.