EXPERIMENTS WITH RUSSIAN TO KAZAKH SENTENCE ALIGNMENT Zhenisbek Assylbekov Nazarbayev University, Astana, Kazakhstan zhassylbekov@nu.edu.kz Bagdat Myrzakhmetov, Aibek Makazhanov National Laboratory Astana, Astana, Kazakhstan bagdat.myrzakhmetov@nu.edu.kz, aibek.makazhanov@nu.edu.kz Sentence alignment is the final step in building parallel corpora, which arguably has the greatest impact on the quality of a resulting corpus and the accuracy of machine translation systems that use it for training. However, the quality of sentence alignment itself depends on a number of factors. In this paper we investigate the impact of several data processing techniques on the quality of sentence alignment. We develop and use a number of automatic evaluation metrics, and provide empirical evidence that application of all of the considered data processing techniques yields bitexts with the lowest ratio of noise and the highest ratio of parallel sentences. Keywords: sentence alignment, sentence splitting, lemmatization, parallel corpus, Kazakh language УДК 81’32 1. Introduction Sentence alignment (SA) is the problem of identification of parallel sentences (pairs of sentences that constitute translations from a source to a target language) in a given pair of source and target documents, where the target document is assumed to be a translation of the source (mutual translation assumption is also common). More formally, given a source document D s and a target document D t represented as lists of sentences S and T respectively, SA is the task of building a list of pairs P, where each pair p contains 0 or more (ideally one) source sentence(s) aligned to 0 or more (ideally 1) target sentence(s). Approaches that consider sentence length correlations [1, 2], bilingual lexicon-based solutions [3], and combinations of the two [4] have been proposed in the past to solve this problem in a sufficiently accurate and efficient manner. In this paper we do not offer a new solution to the problem, nor do we try to improve the existing approaches. Our goal is to investigate what could be done to the input data (not to the methods) to improve the quality of SA. We begin by asking a few questions, which are inspired directly by the definition of the problem and by ways of solving it. First, a formal definition of SA problem assumes that documents to be aligned are already split into sentences. However, in practice it is almost never the case, and one has to perform splitting before SA. Assuming that one uses for that a statistical approach that requires training, e.g. punkt splitter [5], a question regarding the choice of training data arises: does it suffice to train the splitter on any data, or would it be beneficial to train on a sample drawn from a target domain? Second, assuming one uses a lexicon based approach to SA, should one bother trying to reduce typos and data sparsity of the input, and what lexicon to use automatically induced or handcrafted? Lastly, after sentences have been aligned can we still increase the portion of parallel pairs? In an attempt to answer these questions, we propose to employ the following five data processing techniques: (i) domain adapted sentence splitting; (ii) error correction; (iii) lemmatization (to reduce sparsity); (iv) use of handcrafted bilingual lexicons; (v) junk removal. The objective of this work is to assess the impact of the proposed data processing techniques on SA accuracy and find the combination of thereof which maximizes the quality of parallel corpora produced by SA. 2. Data Collection For our experiments we have crawled three websites, akorda.kz, strategy2050.kz, astana.gov.kz, using our own Python scripts to download only specific branches of these sites - mainly news and