87$&/,5±*HQHUDO4XHU\7UDQVODWLRQ)UDPHZRUN IRU6HYHUDO/DQJXDJH3DLUV +HLNNL.HVNXVWDOR 'HSDUWPHQWRI,QIRUPDWLRQ6WXGLHV 8QLYHUVLW\RI7DPSHUH )LQODQG Г KHLNNLNHVNXVWDOR#XWDIL 7XULG+HGOXQG 'HSDUWPHQWRI,QIRUPDWLRQ6WXGLHV 8QLYHUVLW\RI7DPSHUH )LQODQG Г KHGOXQG#VKKIL (LMD$LULR 'HSDUWPHQWRI,QIRUPDWLRQ6WXGLHV 8QLYHUVLW\RI7DPSHUH )LQODQG Г HLMDDLULR#XWDIL 1. SYSTEM DESCRIPTION The UTACLIR query translation process was first designed for the CLEF 2000 campaign [2], followed by an improved version for the CLEF 2001 [3]. The present implementation constitutes a completely revised version of the software. The process input is a structured or unstructured source language query together with codes expressing the source and target languages. The output is the translated target query. The solution is based on building a tree data structure consisting of three levels: (1) the original source keys, (2) the processed source keys, and (3) the translated and processed target keys. Key processing is based on key type [2][3]. The following six key types are recognized in case of non-compound source languages. In case the source key is recognized by morphological software: (i) keys producing only such basic forms which all are stop words, (ii) keys producing at least one translatable basic form, (iii) keys producing no translatable basic forms. In case the source key is not recognised: (iv) the key itself is a stop word, (v) the key is translatable, (vi) the key is untranslatable. For compound languages, a 7th key type is added: (vii) untranslatable but splittable source key. In the target query formation we support key normalization, synonym structuring of translation equivalents, and phrase-based structuring of target phrases [5]. For untranslatable keys, fuzzy matching techniques [4] are utilized to select keys from the database index. The system operates on Solaris workstation. It is programmed in C and consists of a library archive containing general and resource specific functions. General functions are called to implement the translation service. Resource specific functions (e.g. interfaces to local dictionaries) and general functions (e.g. new ways to structure the target query) can be added by writing functions satisfying function prototype definitions given by the general system framework. Morphological analysis and word translation are performed by utilizing the library archives of morphological software TWOL (4 languages) and GlobalDix dictionary (over 300 language pairs). Present approach has already been put to practice as part of European Union funded CLARITY project [6]. Future development includes developing effective transitive query translation methods [1]. Categories & Subject Descriptors: H.3.3. INFORMATION STORAGE AND RETRIEVAL General Terms: Design, Experimentation, Languages Keywords: Query translation, CLIR, cross-lingual information retrieval, dictionary based retrieval 2. ACKNOWLEDGEMENTS ENGTWOL (Morphological Transducer Lexicon Description of English): Copyright © 1989-1992 Atro Voutilainen and Juha Heikkilä. FINTWOL (Morphological Description of Finnish): Copyright © Kimmo Koskenniemi and Lingsoft Oy. 1983-1993. GERTWOL (Morphological Transducer Lexicon Description of German): Copyright © 1997 Kimmo Koskenniemi and Lingsoft, Inc. GlobalDix and MOT Dictionary software was used for automatic word-by-word translations. Copyright © 1998-2002 Kielikone Oy, Finland. SWETWOL (Morphological Transducer Lexicon Description of Swedish): Copyright © 1998 Fred Karlsson and Lingsoft, Inc. TWOL-R (Run-Time Two-Level Program): Copyright © Kimmo Koskenniemi and Lingsoft Oy. 1983-1992. 3. REFERENCES [1] Gollins T. and Sanderson M. (2001) Improving Cross Language Information Retrieval with Triangulated Translation. In SIGIR ’01 (New Orleans, USA, 2001), ACM Press, 90-95. [2] Hedlund T, Keskustalo H, Pirkola A, Sepponen M & Järvelin K, 2001. Bilingual tests with Swedish, Finnish and German queries: dealing with morphology, compound words and query structure. In CLEF 2000 (Lisbon, Portugal, 2001), Springer, 210-223. [3] Hedlund T, Keskustalo H, Pirkola A, Airio E & Järvelin K, 2002. UTACLIR@CLEF 2001 – Effects of Compound Splitting and N-gram Techniques. To appear in CLEF 2001 Workshop proceedings. [4] Pirkola A., Keskustalo H., Leppänen E., Känsälä, A-P., Järvelin K., 2002. Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants. Information Research, 7 (2) [Available at http://InformationR.net/ir/7-2/paper126.html]. [5] Pirkola A, Hedlund T, Keskustalo H & Järvelin K, 2001. Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval, 4(3/4), pp. 209-230. [6] Project description of Clarity. [Available at: http://clarity.shef.ac.uk/]. Copyright is held by the author/owner(s). SIGIR’02, August 11-15, 2002, Tampere, Finland. ACM 1-58113-561-0/02/0008. 448