Data Fusion for Effective European Monolingual Information Retrieval Jacques Savoy Institut interfacultaire d’informatique, Universit´ e de Neuchˆatel, Pierre-`a-Mazel 7, 2001 Neuchˆatel, Switzerland Jacques.Savoy@unine.ch Abstract. For our fourth participation in the CLEF evaluation cam- paigns, our first objective was to propose an effective and general stop- word list and a light stemming procedure for the Portuguese language. Our second objective was to obtain a better picture of the relative merit of various search engines when processing documents in the Finnish and Russian languages. Finally, based on the Z-score method we suggested a data fusion strategy intended to improve monolingual searches in various European languages. 1 Introduction Making use of experiments we carried out in previous years [1], [2], we are now participating in the French, Finnish, Russian and Portuguese monolingual tasks without relying on dictionaries. Moreover, the IR approaches suggested are fully automatic and used freely available resources. This paper describes the infor- mation retrieval models we used in the monolingual tracks and is organized as follows: Section 2 describes our general approach to building stopword lists and stemmers for use with languages other than English. Section 3 evaluates two probabilistic models and five vector-space schemes using five different lan- guages. Section 4 describes and evaluates various data fusion operators that will hopefully improve retrieval effectiveness. Finally, Section 5 depicts our official runs and presents a broad failure analysis. 2 Stopword Lists and Stemming Procedures In order to define general stopword lists, we first created a list of the top 200 most frequent words found in the various languages, from which some words were removed (e.g., Roma, police, minister, Chirac). From this list of very frequent words, we added articles, pronouns, prepositions, conjunctions or very frequently occurring verb forms (e.g., to be, is, has, etc.). We created a new one for the Portuguese language, adding it to last year’s stopword lists [2] (these lists are available at www.unine.ch/info/clef/). For English we used the list provided by the SMART system (571 words), while for the other European languages, our Published in Lecture Notes in Computer Science 3491, 233-244, 2005 which should be used for any reference to this work 1