USAAR-DCU Hybrid Machine Translation System for ICON 2014 Santanu Pal 1 , Ankit Srivastava 2 , Sandipan Dandapat 2 , Josef van Genabith 1 , Andy Way 2 1 Universit¨ at des Saarlandes, Saarbr ¨ ucken, Germany 2 CNGL Centre for Global Intelligent Content, School of Computing, Dublin City University, Ireland {santanu.pal, josef.vangenabith}@uni-saarland.de {asrivastava, sdandapat, away}@computing.dcu.ie Abstract In this paper, we describe the USAAR- DCU machine translation system submit- ted to the NLP Tools Contest of the Inter- national Conference on Natural Language Processing (ICON 2014). The shared task on statistical machine translation in Indian languages encompassed translating from five languages into Hindi in three differ- ent domains. Our best system achieved an overall BLEU score of 24.61 aver- aged over all language pairs and all do- mains. The main innovations are: (i) ef- fective preprocessing and use of explicitly aligned bilingual terminology i.e. named entities, (ii) simple but effective hybridis- ation technique for using multiple knowl- edge sources. Our hybrid system poten- tially improves over the baseline statistical machine translation performance by in- corporating additional knowledge sources such as the extracted bilingual named en- tities, translation memories, and phrase pairs induced from example-based meth- ods. We report performance on three hy- brid systems as well as results of a con- fusion network-based system combination that combines the best performance of each individual system within the multi- engine pipeline. 1 Introduction In this paper, we present a joint Universit¨ at des Saarlandes (USAAR) and Dublin City University (DCU) submission for the Machine Translation (MT) tools contest at the International Conference on Natural Language Processing (ICON) 2014 us- ing the Hybrid MT system framework. We partic- ipated in the generic translation shared task for the five language pairs, i.e. BengaliHindi (BN–HI), EnglishHindi (EN–HI), MarathiHindi (MR– HI), TamilHindi (TA–HI), and TeluguHindi (TE–HI) across three different domains, namely health, tourism, and general. 1 Recently, corpus-based MT has delivered in- creasingly better quality translations. There are many approaches that have been proposed in the last few decades such as Translation Mem- ory (TM) (Kay, 1980), Example-based Machine Translation (EBMT) (Carl and Way, 2006) and Statistical Machine Translation (SMT) (Koehn, 2010). Out of these, in terms of large-scale evalua- tions, SMT is the most successful and efficient MT paradigm. The quality of SMT mainly re- lies on good quality of word alignment as well as optimal phrase pair estimation, both of which can be achieved by using large amounts of sentence- aligned parallel corpora. However, MT of low- resource language pairs (as is the case in this shared task) usually produces inferior quality translation. Conventionally, TM systems store source and target language translation pairs for effectively reusing the previous translations originally cre- ated by human translators. Conceptually speaking, EBMT is closely related to TM. The difference be- tween the two approaches is that in the case of EBMT, it extracts translations of fragments from the translation model and combines them to pro- duce a grammatically correct translation. Each approach has its own method of acquiring and using translation knowledge from the parallel bilingual translation examples, along with its own advantages and disadvantages. The knowledge representation process in both EBMT and SMT use very different techniques in order to extract resources. The SMT phrases essentially operate on n-grams, rather than grammatical phrases as in 1 The general domain MT system was trained by combin- ing data from the health and tourism domains.