CROSS SENTENCE ALIGNMENT BASED ON SINGULAR VALUE DECOMPOSITION ANNA HO 1 , FAI WONG 1 , FRANCISCO DE OLIVEIRA 1 , YIPING LI 1 1 Faculty of Science and Technology of University of Macau, PO Box 3001 Macao SAR E-MAIL: ma36560@umac.mo , derekfw@umac.mo , olifran@umac.mo , ypli@umac.mo Abstract: This paper describes a way of performing alignment of Portuguese-Chinese bilingual pairs from the given two documents. Extracting the alignment pairs is a critical step for building Portuguese-Chinese bilingual corpus for Example Based Machine Translation Systems (EBMT). In short, the proposed alignment system performs four steps: break down the document into sentences level, score each pair of sentence by different features, apply Singular Value Decomposition on the results, extract the aligned pair base on a similarity function. Keywords: Bilingual document alignment; singular value decomposition; segmentation; cross sentence alignment 1. Introduction Recently, most of the machine language translation related research only focuses on the algorithm of English-Chinese or structurally similar languages. Relatively, Portuguese-Chinese or structurally dissimilar languages has less related resources. Considering there is still improvements can be made in Portuguese-Chinese machine translation algorithm, this paper focus on describing new issues on this related field. Since alignment of bilingual corpus is a significant process in developing an Example Based Machine Translation System, therefore the following sessions in this paper expose the corresponding alignment process step by step. The corpus chosen for testing this algorithm comes from different categories, such as legislative laws, magazines and government official gazette. As the corpus comes from different areas, the proposed alignment algorithm employs the Singular Value Decomposition (SVD) technique in the algorithm so that it is capable to handle not only one-to-one parallel mapping cases, but also cross level sentence alignment cases as shown in Figure 1. Figure 1. Possible Cross Level Sentence Alignment 2. Review Currently, existing literature already provides many different approaches on aligning bilingual documents. Those techniques can be generally classified into three types: lexicon based [1-2], statistical based [3-4], and combination of lexicon and statistical based [5]. Lexicon based techniques mainly makes use of the dictionary to perform the alignment procedure. On the other hand, statistical based approach usually relies on the statistical issues such as length ratio of the sentences in bilingual languages. The combination approach mixes the above two methods in order to gain their advantages thus maximize accuracy of the resulting output. However, the techniques mentioned in the above paragraph are solutions mainly focus on the alignment of structurally similar languages, for example, the research works of Melamed [6] and Fung, et al. [7]. While applying those techniques directly into Portuguese and Chinese documents cannot effectively align comparable or unparallel bilingual corpus with dissimilar language structure. One of our difficulties is to solve the problem of dissimilarity of sentence structure, which cause our approach cannot rely on the appearance of the bilingual sentence. Another difficulty is that meanings expressed in different Chinese words may actually refer to the same Portuguese word. For example, the word “correio” (post) can find a translation “郵包” in the dictionary, but in the corresponding bilingual sentence, it consists the word “郵