Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval Turid Hedlund*, Ari Pirkola, Kalervo JaÈrvelin Department of Information Studies, University of Tampere, PO Box 607, FIN-33101, Tampere, Finland Received 21 January 2000; accepted 12 April 2000 Abstract This paper analyzes the features of the Swedish language from the viewpoint of mono- and cross- language information retrieval (CLIR). The study was motivated by the fact that Swedish is known poorly from the IR perspective. This paper shows that Swedish has unique features, in particular gender features, the use of fogemorphemes in the formation of compound words, and a high frequency of homographic words. Especially in dictionary-based CLIR, correct word normalization and compound splitting are essential. It was shown in this study, however, that publicly available morphological analysis tools used for normalization and compound splitting have pitfalls that might decrease the eectiveness of IR and CLIR. A comparative study was performed to test the degree of lexical ambiguity in Swedish, Finnish and English. The results suggest that part-of-speech tagging might be useful in Swedish IR due to the high frequency of homographic words. 7 2000 Elsevier Science Ltd. All rights reserved. Keywords: Text retrieval; Cross-language information retrieval; Swedish language; Natural language processing 1. Introduction Our society depends on written communication, which today often is created and stored in Information Processing and Management 37 (2001) 147±161 0306-4573/00/$ - see front matter 7 2000 Elsevier Science Ltd. All rights reserved. PII: S0306-4573(00)00024-8 www.elsevier.com/locate/infoproman * Corresponding author: Swedish School of Economics and Business Administration Library, PO Box 479, FIN- 00101 Helsinki, Finland. Tel.: +358-9-43133378; fax: +358-9-43133425. E-mail address: turid.hedlund@shh.® (T. Hedlund).