Developing Lexicon Databases for English to Sinhala Machine Translation B. Hettige 1 , A. S. Karunananda 2 1 Department of Statistics and Computer Science, Faculty of Applied Sciences, University of Sri Jayewardenepura, Sri Lanka 2 Faculty of Information Technology, University of Moratuwa, Sri Lanka budditha@yahoo.com 1 , asoka@itfac.mrt.ac.lk 2 Abstract – Machine translation has been identified as one of the most challenging areas in computing. It has also recognized those hundred percent translations by machines leaves semantic misinterpretations and various other limitations due to richness of lexicon databases in a machine translation system. Therefore, computer-assisted machine translation has been recognized as a potential approach to the first step in any machine translation project. This paper reports on the design and development of lexicon database sub-system for English to Sinhala Computer-Assisted Translation system. This system has customized the famous WordNet for English lexicon database, and structured three Sinhala dictionaries as per Japanese EDR system. SWI- Prolog has been used for the development of the core of the translation system I INTRODUCTION Machine Translation (MT) is a process that translates one natural language into another. In general, any machine translation system contains a source language morphological analyzer, a source language parser, translator, target language morphological analyzer, target language, and several lexicon databases [8]. Translator is used to translate a source language word into the target language. In this case, Machine Translation system needs minimum of three dictionaries such as the source language dictionary, the bilingual dictionary and the target language dictionary. Source language morphological analyzer needs a source language dictionary for Morphological analysis. Bilingual dictionary is used by the Translator for translating the source language into the target language; and the target language morphological generator uses the target language dictionary to generate target language words. Regarding English to Sinhala machine translation point of view, the Machine Translation system needs an English dictionary, an English-Sinhala bilingual dictionary and a Sinhala dictionary. However, pure machine translation becomes a complex task due to syntax, semantic and pragmatics concerns of natural languages. As such, there are number of machine translation approaches available to date. Some of these approaches can be named as Machine Aided Translation, Rule based Translation, Statistical Translation, Example based translation and Knowledge based Translation [24]. Machine Aided Translation is also called Computer- Assisted Translation (CAT), computer-aided translation. CAT is sharing the task between man and machine. The Anusaaraka [1] is the popular machine aided translation system for Indian languages that makes text in one Indian language accessible to another Indian language. Anglabharti [2] is another machine translation system that represents a machine-aided translation methodology specifically designed for translating English to Indian languages. Also Angalahindi [3] translates English to Hindi using machine-aided translation methodology. Rule Based approach Transfer rules map from source to target language representations. PLOENG is an example for Rule-Based machine translation system that translates English to Polish [16]. Statistical Machine translation approach is a popular approach that gives alternative possible translations and finds the most probable one in the target language. This method needs large corpus of the target language. For example, MANOS [27] is a Statistical Machine translation system. Example-based Machine translator uses the extend idea of translation memories and reuses existing translation fragments. METIS-II is an example-based machine translation system [28]. However, structure of the lexical database for Machine translation depends on the Machine Translation approach to approach. Machine aided translation needs minimum of lexical resources than others. Knowledge based Machine translation system needs more lexical resources with suitable knowledge representation. Statistical machine translation needs lexical resources and a considerably large target language corpus. It is evident that all these approaches suffer from lack of lexicon and semantic information for doing a quality translation. Perhaps the best approach to solve this issue is the use of human intervention in the post-editing phase of a translation process. This will not only improve the quality of the translation, but also avoids the limitations of lexicon databases in a machine translation system. This paper describes our research to design and development of lexicon resources for English to Sinhala Computer-Assisted system. In this work, four dictionaries including English-Sinhala Bilingual dictionary and three Sinhala dictionaries; Sinhala base dictionary, Sinhala rule dictionary and Sinhala concept dictionary have been developed. The entire, lexicon resources of the system have been implemented using SWI-Prolog. The rest of this paper is organized as follows. Section II describes the existing lexical database for machine