A Compact Data Structure for Searchable Translation Memories Chris Callison-Burch Colin Bannard University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW {chris,colin}@linearb.co.uk Josh Schroeder Linear B Ltd. 39 B Cumberland Street Edinburgh EH3 6RA josh@linearb.co.uk Abstract In this paper we describe searchable trans- lation memories, which allow translators to search their archives for possible trans- lations of phrases. We describe how sta- tistical machine translation can be used to align subsentential units in a translation memory, and rank them by their proba- bility. We detail a data structure that al- lows for memory-efficient storage of the index. We evaluate the accuracy of trans- lations retrieved from a searchable trans- lation memory built from 50,000 sentence pairs, and find a precision of 86.6% for the top ranked translations. 1 Introduction The work of any translator or translation agency contains significant amounts of repetition, and trans- lation archives are consequently a vital asset. Cur- rent translation memory systems provide a valuable means for translators to exploit this resource in or- der to increase productivity and to ensure consis- tency. Existing translation memory systems work by retrieving the translation of full sentences that are exactly or approximately matched in a database of a translator’s past work (Trujillo, 1999). Trans- lation memories provide facilities for the automatic alignment of sentences and paragraph units (Gale and Church, 1993; Kay and R¨ oscheisen, 1993), but aligning subsentential units is usually an involved, manual process. Matching on the sentence-level is a rather severe restriction which means that only very limited re- use is made of the information contained within a translation archive. A translator will frequently use phrases, words and other subsentential strings that s/he has translated before. However, unless these are contained as a whole unit within the database, conventional translation memory systems are unable to retrieve translations for them. This paper describes a search tool which allows more flexible information retrieval than sentence- level matching. The usefulness of a translation database might be greatly increased if it could be easily searched, for example by returning focused translations when a user queries it with a single phrase. This paper describes tools which offer pre- cisely that facility. We present searchable transla- tion memories which allow Google-style searching of translation archives. Figure 1 illustrates the use of the technology. The figure shows example results of querying a search- able translation memory built from French and En- glish portions of the proceedings of the European Parliament. The user has typed the search phrase west bank, and similar to a parallel concordancer (Barlow, 2004), the system has returned a list of sen- tences that the phrase occurs in. However, unlike a concordancer, the searchable translation memory picks out those phrases which constitute the likely translations of the phrase (cisjordanie, territoires de cisjordanie, rive ouest, and rive gauche du jour- dain), groups retrieved sentences by these transla- tions, and ranks the groups according to their proba- bility. There are two primary technical challenges for searchable translation memories. The first is the ability to index a translation memory so that it con- tains the correspondences between translated words and phrases across the two languages. For this we