Acquis Communautaire Sentence Alignment using Support Vector Machines Alexandru Ceauşu, Dan Ştefănescu, Dan Tufiş Research Institute for Artificial Intelligence of the Romanian Academy 13, Calea 13 Septembrie, 050711, Bucharest {alceausu, danstef, tufis}@racai.ro Abstract Sentence alignment is a task that requires not only accuracy, as possible errors can affect further processing, but also requires small computation resources and to be language pair independent. Although many implementations do not use translation equivalents because they are dependent on the language pair, this feature is a requirement for the accuracy increase. The paper presents a hybrid sentence aligner that has two alignment iterations. The first iteration is based mostly on sentences length, and the second is based on a translation equivalents table estimated from the results of the first iteration. The aligner uses a Support Vector Machine classifier to discriminate between positive and negative examples of sentence pairs. 1. Introduction Sentence alignment is a prerequisite for any parallel corpora processing and has been proven that very good results can be obtained with practically no prior knowledge about the concerned languages. However, as the sentence alignment errors may be detrimental to further processing, ensuring higher sentence alignment accuracy is a continuous concern for many NLP practitioners. Sentence alignment is not characterized only by accuracy. To be language pair independent is other important demand that a sentence aligner must meet. In addition, many implementations stress on their ability to use little computation resources. The sentence aligner employs a Support Vector Machine classifier for the discrimination between “good” and “bad” sentence pairs. The aligner was tested on selected pairs of languages from the recently released 20- languages Acquis Communautaire parallel corpus (http://wt.jrc.it/lt/acquis/ ). 2. Related Work One of the best-known algorithms for aligning parallel corpora (Gale and Church, 1991) is based on the lengths of sentences being reciprocal translations and a very popular implementation is the Vanilla aligner (http://nl.ijs.si/telri/Vanilla/ ) due to P. Danielsson and D. Ridings. Chen (1993) developed a method based on optimizing word translation probabilities that has better results than the sentence-length based approach, but it demands much more time to complete and requires more computing resources. Melamed (1996) also developed a method based on word translation equivalence and geometrical mapping. Moore (2002) presents a hybrid approach that has three stages. In the first stage, the algorithm uses length- based methods for sentence alignment. In the second stage, a translation equivalence table is estimated from the aligned corpus resulted in the first stage. The method used for translation equivalents estimation is based on IBM model 1 (Brown, 1993). The final step uses a combination of length-based methods and word correspondence to find 1-1 sentence alignments. The aligner has an excellent precision for one-to-one alignments because it was meant for acquisition of very accurate training data for machine translation experiments. Another problem of this aligner is that it was tested on only 10,000 sentence pairs (it cannot process more than 100,000 sentence pairs). 3. Features Selection In the process of features selection, any sentence pair can be characterized by a collection of scores for each feature. Therefore, the alignment problem can be reduced to a two-class classification task: discriminating between “good” and “bad” alignments. One of the best performing formalism for this task proves to be Vapnik’s Support Vector Machine (Vapnik, 1995). We used an out-of-the-box solution for Support Vector Machine (SVM) training and classification - LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm (Fan et al., 2005) with default parameters (C-SVC classification and radial basis kernel function). The accuracy of the SVM model was evaluated (10- fold cross validation) on five manually aligned files from the Acquis Communautaire corpus for the language pairs English-French, English-Italian, and English-Romanian. For each language pair experiment we used approximately 1000 sentence pairs. To train the model the SVM model, each sentence pair from the “gold standard” is characterized by a collection of scores on features like translation equivalence, word length correlation, word rank correlation, etc. The examples of “bad” alignment were generated automatically from the gold standard, replacing one sentence in a correctly aligned pair with another sentence in the three-sentence vicinity. The replaced sentence was randomly selected from either the previous or following sentences. The SVM classifier performance increases considerably when it uses more highly discriminative features. The irrelevant features or those with less discriminative power negatively influence the SVM classifier accuracy. Therefore, in the process of features selection we evaluated a series of features, out of which the best performing are listed in the Table 1. The non- word length correlation in Table 1 refers to non-lexical tokens, language independent such as punctuation, 2134