Does dictionary based bilingual retrieval work in a non-normalized index? Eija Airio * , Kimmo Kettunen University of Tampere, Department of Information Studies, Kanslerinrinne 1, FIN-33014, Finland article info Article history: Received 23 May 2008 Received in revised form 9 May 2009 Accepted 17 May 2009 Available online 13 June 2009 Keywords: Bilingual retrieval Non-normalized index Word form generation S-gramming abstract Many operational IR indexes are non-normalized, i.e. no lemmatization or stemming tech- niques, etc. have been employed in indexing. This poses a challenge for dictionary-based cross-language retrieval (CLIR), because translations are mostly lemmas. In this study, we face the challenge of dictionary-based CLIR in a non-normalized index. We test two optional approaches: FCG (Frequent Case Generation) and s-gramming. The idea of FCG is to automatically generate the most frequent inﬂected forms for a given lemma. FCG has been tested in monolingual retrieval and has been shown to be a good method for inﬂected retrieval, especially for highly inﬂected languages. S-gramming is an approximate string matching technique (an extension of n-gramming). The language pairs in our tests were English–Finnish, English–Swedish, Swedish–Finnish and Finnish–Swedish. Both our approaches performed quite well, but the results varied depending on the language pair. S-gramming and FCG performed quite equally in all the other language pairs except Finn- ish–Swedish, where s-gramming outperformed FCG. Ó 2009 Elsevier Ltd. All rights reserved. 1. Introduction Cross-language retrieval, CLIR, is retrieval across languages: the query language differs from the document language. The query language is called the source language and the document language the target language. When there is only one target language, we are dealing with bilingual IR. There are two basic alternatives in CLIR: either the queries are translated into the target language(s), or the documents are translated into the source language. The ﬁrst alternative is simpler and more pop- ular than the latter. (See Kishida, 2005). There are various translation approaches. The most common are the dictionary- based approach, the machine translation (MT) approach, and the corpus-based approach. The dictionary-based approach is based on a machine readable translation dictionary. It is quite popular in CLIR research, because it is simple and cheap. Machine translation is also simple, because it is possible to input the whole query in an MT system. Queries are typically short, however, or they are just sets of terms, and thus there is often not enough context for an MT system to perform com- petent translation (Airio, 2008). There are free MT systems, for example Altavista Babelﬁsh (see http://babelﬁsh.altavi- sta.com/), but Finnish and Swedish are not included in many free systems. Corpus-based translation is based on parallel or comparable corpora and is thus restricted to the topical area of the corpus. (Kishida, 2005.) The traditional answer for word form variation is normalization: document words are normalized before indexing, and query words are normalized accordingly. The two common normalization methods are lemmatization and stemming. For morphologically rich languages, like Finnish, German and Slovenian, normalization is vital: retrieval in a normalized index with normalized queries gives statistically signiﬁcantly better results than inﬂected retrieval in an inﬂected (non-normal- ized) index (see Airio, 2006; Braschler & Ripplinger, 2004; Hollink, Kamps, Monz, & De Rijke, 2004; Popovic & Willet, 1992). According to many studies, normalization is not so important for morphologically simple languages like English 0306-4573/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2009.05.006 * Corresponding author. Tel.: +358 50 3086896; fax: +358 3 3551 6560. E-mail addresses: eija.airio@uta.ﬁ (E. Airio), kimmo.kettunen@uta.ﬁ (K. Kettunen). Information Processing and Management 45 (2009) 703–713 Contents lists available at ScienceDirect Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman