COMPARISON OF PART-OF-SPEECH TAGSET FOR IMPROVING ENGLISH-INDONESIAN STATISTICAL MACHINE TRANSLATION Herry Sujaini*, Hammam Riza** and Arry Akhmad Arman*** * Faculty of Engineering, University of Tanjungpura, Indonesia ** The Agency for the Assessment and Application of Technology, Indonesia *** School of Electrical Engineering and Informatics, Bandung Institute of Technology, Indonesia ABSTRACT Statistical Machine Translation (SMT) model has limitations on mapping phrases or blocks of the source language to the target without the use of linguistic information. We can add part-of-speech (PoS) information as one of the linguistic features to improve the quality of translations. Indonesian PoS tagsets that are used to process natural language computing is very diverse, so we experimented to determine the best PoS tagset used as additional linguistic information on SMT. This paper discuss various PoS tag information as a feature in the SMT factored translation model, where we experiment using Moses and BLEU as an evaluation tool. We use several PoS tagset from computational linguistic studies in Indonesia. The experimental result shows that , Wicaksono's PoS tagset give a better BLEU score than the other PoS tagsets. This will enable the improvement of English-Indonesian SMT as part of our participation in the network-based ASEAN-MT system. Index Terms—Indonesian PoS tagset, Statistical Machine Translation, Moses, BLEU score 1. INTRODUCTION Natural language processing can be used for various purposes, one of which is for statistical machine translation (SMT). One of SMT approach is to use a statistical approach that uses the concept of probability. Each pair of sentences (Sc,Tg) will be given a P(Tg|Sc) which is interpreted as a probability distribution in which SMT will result in the target language Tg when we give Sc in the source language. Phrase-based translation (phrase-based models) limited to the mapping of text snippets without the use of additional linguistic information such as morphology, syntax, or semantics. Additional information has proved invaluable to integrate the steps of pre-processing or post-processing. As in phrase-based, factored translation models can be seen as a combination of several components (language models, reordering models, translation steps, generation steps). These components define one or more of the features that are incorporated in a log-linear model: where Z is a constant normally neglected in the implementation stage. To calculate the probability of the translation e of the input sentence f, every feature of the function hi can be evaluated. Several works have shown that the accuracy of the better MT with additional features such as lemma, part of speech (PoS), gender and others. This study focused on the feature tagset of PoS, specifically for use in Indonesian. Although it has been known that the addition of PoS features can improve the accuracy of the translation, but it is not known what kind PoS tagsets can be optimized to improve the accuracy of the results of a MT. PoS as one of the features on the machine translators can improve the accuracy of the translation [1, 2, 3, 4, 5]. PoS tagsets used in NLP research, particularly TM system highly variable [6, 7, 8]. The problem is set PoS are most optimal for improving the accuracy of the results of the SMT. 2. FACTORED TRANSLATION MODEL Factored translation models integrate additional linguistic mark up at the word level. Each type of additional word- level information is called a factor. Each type of additional information on the level of words is called a factor. See Figure 1 for an illustration of the type of information that can be useful for the translation process on a TM. The translation of lemma and morphological factors separately would help with sparse data problems in morphologically rich languages. Additional information such as PoS may be helpful in making reordering or grammatical coherence decisions. The presence of morphological features on the