In Journal of Information Science and Technology, Romanian Academy, Vol. 4, No. 3, 2001 Computational bilingual lexicography: automatic extraction of translation dictionaries Dan Tufiş and Ana-Maria Barbu RACAI-Romanian Academy Center for Artificial Intelligence 13, "13 Septembrie",RO-74311, Bucharest, 5, Romania {tufis, abarbu}@racai.ro Abstract The paper describes a simple but very effective approach to extraction translation equivalents from parallel corpora. We briefly present the multilingual parallel corpus used in our experiments and then describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is presented in some details in terms of precision, recall and processing time. The baseline algorithm was used to extract 6 bilingual lexicons and it was evaluated on four of them. The second algorithm was evaluated only on the Romanian-English noun lexicon. An analysis of the missed or wrong translation equivalents figured out various factors, both intrinsic, due to the method and extrinsic due to the working data (accuracy of the pre-processing, quality of translation, bitext language relatedness). We conclude by discussing the merits and the drawbacks of our method in comparison with other works and comment on further developments. Keywords: alignment, bitext, bilingual dictionaries, evaluation, hapax-legomena, lemmatization, parallel corpora, tagging 1 Introduction Automatic Extraction of bilingual lexicons from parallel texts might seem a futile task, given that more and more bilingual lexicons are printed nowadays and they can be easily turned into machine-readable lexicons. However, if one considers only the possibility of automatic enriching the presently available electronic lexicons, with very limited manpower and lexicographic expertise, the problem reveals a lot of potential. The scientific and technological advancement in many domains is a constant source of new term coinage and therefore keeping up with multilingual lexicography in such areas is very difficult unless computational means are used. On the other hand, translation bilingual lexicons appear to be quite different from the corresponding printed lexicons, meant for the human users. The marked difference between printed bilingual lexicons and bilingual lexicons as needed for automatic translation is not really surprising. The traditional lexicography deals with translation equivalence (the underlying concept of the bilingual lexicography) in an inherently discrete way. What is to be found in a printed dictionary or lexicon (bi- or multilingual) is just a set of general basic translations. In case of specialised registers, general lexicons are usually not very useful. A pair of texts that represent the translation of each other is called a parallel text or a bitext. Extracting bilingual dictionaries from a bitext is a process based on the notion of translation equivalence. In a given parallel text, the assumption is that the same meaning is linguistically expressed in two or more languages. Meaning identity between two or more representations