87$&/,5±*HQHUDO4XHU\7UDQVODWLRQ)UDPHZRUN
IRU6HYHUDO/DQJXDJH3DLUV
+HLNNL.HVNXVWDOR
'HSDUWPHQWRI,QIRUPDWLRQ6WXGLHV
8QLYHUVLW\RI7DPSHUH
)LQODQG
Г
KHLNNLNHVNXVWDOR#XWDIL
7XULG+HGOXQG
'HSDUWPHQWRI,QIRUPDWLRQ6WXGLHV
8QLYHUVLW\RI7DPSHUH
)LQODQG
Г
KHGOXQG#VKKIL
(LMD$LULR
'HSDUWPHQWRI,QIRUPDWLRQ6WXGLHV
8QLYHUVLW\RI7DPSHUH
)LQODQG
Г
HLMDDLULR#XWDIL
1. SYSTEM DESCRIPTION
The UTACLIR query translation process was first designed for the
CLEF 2000 campaign [2], followed by an improved version for the
CLEF 2001 [3]. The present implementation constitutes a
completely revised version of the software. The process input is a
structured or unstructured source language query together with
codes expressing the source and target languages. The output is the
translated target query. The solution is based on building a tree data
structure consisting of three levels: (1) the original source keys, (2)
the processed source keys, and (3) the translated and processed
target keys. Key processing is based on key type [2][3]. The
following six key types are recognized in case of non-compound
source languages. In case the source key is recognized by
morphological software: (i) keys producing only such basic forms
which all are stop words, (ii) keys producing at least one translatable
basic form, (iii) keys producing no translatable basic forms. In case
the source key is not recognised: (iv) the key itself is a stop word,
(v) the key is translatable, (vi) the key is untranslatable. For
compound languages, a 7th key type is added: (vii) untranslatable
but splittable source key. In the target query formation we support
key normalization, synonym structuring of translation equivalents,
and phrase-based structuring of target phrases [5]. For
untranslatable keys, fuzzy matching techniques [4] are utilized to
select keys from the database index. The system operates on Solaris
workstation. It is programmed in C and consists of a library archive
containing general and resource specific functions. General
functions are called to implement the translation service. Resource
specific functions (e.g. interfaces to local dictionaries) and general
functions (e.g. new ways to structure the target query) can be added
by writing functions satisfying function prototype definitions given
by the general system framework. Morphological analysis and
word translation are performed by utilizing the library archives of
morphological software TWOL (4 languages) and GlobalDix
dictionary (over 300 language pairs). Present approach has already
been put to practice as part of European Union funded CLARITY
project [6]. Future development includes developing effective
transitive query translation methods [1].
Categories & Subject Descriptors: H.3.3. INFORMATION
STORAGE AND RETRIEVAL
General Terms: Design, Experimentation, Languages
Keywords: Query translation, CLIR, cross-lingual information
retrieval, dictionary based retrieval
2. ACKNOWLEDGEMENTS
ENGTWOL (Morphological Transducer Lexicon Description of
English): Copyright © 1989-1992 Atro Voutilainen and Juha
Heikkilä. FINTWOL (Morphological Description of Finnish):
Copyright © Kimmo Koskenniemi and Lingsoft Oy. 1983-1993.
GERTWOL (Morphological Transducer Lexicon Description of
German): Copyright © 1997 Kimmo Koskenniemi and Lingsoft,
Inc. GlobalDix and MOT Dictionary software was used for
automatic word-by-word translations. Copyright © 1998-2002
Kielikone Oy, Finland. SWETWOL (Morphological Transducer
Lexicon Description of Swedish): Copyright © 1998 Fred Karlsson
and Lingsoft, Inc. TWOL-R (Run-Time Two-Level Program):
Copyright © Kimmo Koskenniemi and Lingsoft Oy. 1983-1992.
3. REFERENCES
[1] Gollins T. and Sanderson M. (2001) Improving Cross
Language Information Retrieval with Triangulated Translation.
In SIGIR ’01 (New Orleans, USA, 2001), ACM Press, 90-95.
[2] Hedlund T, Keskustalo H, Pirkola A, Sepponen M & Järvelin
K, 2001. Bilingual tests with Swedish, Finnish and German
queries: dealing with morphology, compound words and query
structure. In CLEF 2000 (Lisbon, Portugal, 2001), Springer,
210-223.
[3] Hedlund T, Keskustalo H, Pirkola A, Airio E & Järvelin K,
2002. UTACLIR@CLEF 2001 – Effects of Compound
Splitting and N-gram Techniques. To appear in CLEF 2001
Workshop proceedings.
[4] Pirkola A., Keskustalo H., Leppänen E., Känsälä, A-P.,
Järvelin K., 2002. Targeted s-gram matching: a novel n-gram
matching technique for cross- and monolingual word form
variants. Information Research, 7 (2) [Available at
http://InformationR.net/ir/7-2/paper126.html].
[5] Pirkola A, Hedlund T, Keskustalo H & Järvelin K, 2001.
Dictionary-Based Cross-Language Information Retrieval:
Problems, Methods, and Research Findings. Information
Retrieval, 4(3/4), pp. 209-230.
[6] Project description of Clarity. [Available at:
http://clarity.shef.ac.uk/].
Copyright is held by the author/owner(s).
SIGIR’02, August 11-15, 2002, Tampere, Finland.
ACM 1-58113-561-0/02/0008.
448