Thomson Legal and Regulatory at CLEF 2001: monolingual and bilingual experiments Hugo Molina-Salgado, Isabelle Moulinier, Mark Knutson, Elizabeth Lund, Kirat Sekhon TLR 610 Opperman Drive Eagan, MN 55123 USA Isabelle.Moulinier@westgroup.com Thomson Legal and Regulatory participated in the monolingual track for all five languages and in the bilingual track with Spanish-English runs. Our monolingual runs for Dutch, Spanish and Italian use settings and rules derived from our runs in French and German last year. Our bilingual runs compared merging strategies for query translation resources. Introduction Thomson Legal and Regulatory (TLR) participated in CLEF-2001 with two goals: reuse of rules and settings inside a family of languages for monolingual retrieval, and start our effort on bilingual retrieval. In our monolingual runs, we considered Dutch and German as being one family of languages, while French, Spanish and Italian formed another. We used the parameters we derived from our runs at CLEF- 2000 for German and French for each language in their respective family. In addition, we investigated the use of phrases for French and Spanish document retrieval. Our first attempt at the bilingual track was from Spanish queries to English documents. In that task, we experimented with combining various resources for query translation. Our submitted runs used similarity thesauri and a machine-readable dictionary to translate a Spanish query into a single English query. We also compared our official runs with the merging of individual runs, one per translation resource. In this paper, we briefly present our search engine and the settings common to all experiments. Then, we discuss our bilingual effort. Finally, we describe our participation in the monolingual track. General System Description The WIN system is a full-text natural language search engine, and corresponds to TLR/West Group’s implementation of the inference network retrieval model. While based on the same retrieval model as the INQUERY system [BCC93], WIN has evolved separately and focused on the retrieval of legal material in large collections in a commercial environment that supports both Boolean and natural language searches [Tur94]. WIN has also been modified to support non-English document retrieval. This included localization of tokenization rules (for instance, handling elision for French and Italian) and stemming. Stemming of non- English terms is performed using a third-party toolkit, the LinguistX platform ® commercialized by Inxight 1 . A variant of the Porter stemmer is used for English. The WIN engine supports various strategies for computing term beliefs and document scores. We used a standard tf-idf for computing term beliefs in all our runs. Among all the variants, we retained document scoring and portion scoring. Document scoring assigns a score to the document as a whole. This was used in our bilingual runs. Portion scoring finds the best dynamic portion in a document and combines the score of the best portion to the score of the whole document. Portion scoring can be considered as an approximation of paragraph scoring (we used this setting last year for French), when documents have no paragraph. We used portion scoring in our monolingual runs. A WIN query consists of concepts extracted from natural language text. Normal WIN query processing eliminates stopwords and noise phrases (or introductory phrases), recognizes phrases or other important concepts for special handling, and detects misspellings. Many of the concepts ordinarily recognized by WIN are specific to English documents, in particular within the legal domain. Query processing in WIN 1 Information can be found at http://www.inxight.com/products_sp/linguistx/index.html.