International Journal on Asian Language Processing 21 (2): 57-70 57 Pitman-Yor Process-Based Language Models for Machine Translation Tsuyoshi Okita, Andy Way Dublin City University, CNGL / School of Computing, Glasnevin, Dublin 9, Ireland {tokita, away}@computing.dcu.ie Abstract The hierarchical Pitman-Yor process-based smoothing method applied to language model was proposed by Goldwater and by Teh; the performance of this smoothing method is shown comparable with the modified Kneser-Ney method in terms of perplexity. Although this method was presented four years ago, there has been no paper which reports that this language model indeed improves translation quality in the context of Machine Translation (MT). This is important for the MT community since an improvement in perplexity does not always lead to an improvement in BLEU score; for example, the success of word alignment measured by Alignment Error Rate (AER) does not often lead to an improvement in BLEU. This paper reports in the context of MT that an improvement in perplexity really leads to an improvement in BLEU score. It turned out that an application of the Hierarchical Pitman- Yor Language Model (HPYLM) requires a minor change in the conventional decoding process. Additionally to this, we propose a new Pitman-Yor process-based statistical smoothing method similar to the Good-Turing method although the performance of this is inferior to HPYLM. We conducted experiments; HPYLM improved by 1.03 BLEU points absolute and 6% relative for 50k EN-JP, which was statistically significant. Keywords Statistical Machine Translation, statistical smoothing method, hierarchical Pitman-Yor process, language models, Kneser-Ney method, Chinese restaurant process. 1 Introduction Statistical approaches or non-parametric Machine Learning methods estimate some targeted statistical quantities based on the (true) posterior distributions in a Bayesian manner (Bishop, 2006) or based on the underlying fixed but unknown (joint) distributions from which we assume that we sample our training examples in a frequentist manner (Vapnik, 1998). In NLP (Natural Language Processing), such distributions are observed by simply counting (joint / conditional) events, such as c(w), c(w 0 , w 1 , w 2 ) and c(w 3 | w 1 , w 2 ) where w denotes words and c(·) denotes a function to count events; since such quantities are often discrete, it is unlikely that such events will be counted incorrectly at first sight. However, it is a well-known fact in NLP that such counting methods are often unreliable if the size of the corpus is too small compared to the model complexity. Researchers in NLP often try to rectify such counting of (joint or conditional) events