Towards a Multilingual Aligned Parallel Corpus Imad Zeroual and Abdelhak Lakhouaja Computer Sciences Laboratory Faculty of Sciences, Mohammed First University Oujda, Morocco {mr.imadine, abdel.lakh}@gmail.com Abstract—Nowadays, there are a large number of satisfying studies on monolingual corpora and the amount of its available data grew signiﬁcantly over the last years. Unfortunately, not all types of corpora have beneﬁted equally from this growth. An example of such corpora is the multilingual aligned parallel corpus, where there are just a few cases in the cross-language research area. Thus, the goal behind this work is to produce a new aligned multilingual parallel corpus and increase the amount of work in being carried out on the building of such corpora. In this paper, we highlight ongoing work of creating a multilingual aligned parallel corpus of subtitles from TEDx Talks events. The corpus currently contains roughly 6,000 multilingual of aligned subtitles covering 200 video talks in different languages (Arabic, English, French, Spanish, Italian, etc) and it covers a variety of topics such as Business, Education, Environment, etc. Our corpus is divided into two sub corpora. The first one contains about 200 files for each 15 languages and the second one is available in 30 languages with an average size of roughly 100 files per language. Keywords — Natural Language Processing; Multilingual; Parallel corpora; Sentence alignment. I. INT RODUCTION The progress in most of Natural Language Processing (NLP) research fields is driven by the availability of data. The multilingual parallel corpora have been shown important value in various NLP applications and research disciplines [1]. Some of them are word sense disambiguation, where the word senses are derived from word alignments on a parallel corpus instead of a predefined monolingual sense-inventory such as WordNet [2]. Yet, parallel corpora used for the evaluation of multilingual multi-document summarization [3], part of speech tagging [4] and syntactic annotation applied to the parallel corpus Prague Czech-English Dependency Treebank [5, 6]. Parallel corpora improve named entity translation which plays a vital role in applications like cross-lingual information retrieval, and machine translation [7]. Also, the statistical machine translation systems based on probabilistic translation models are generally trained using sentence- aligned parallel corpora [8], [9]. In contrast to monolingual language corpora, there are only a few parallel corpora. In fact, most cases of those corpora tend to be bilingual rather than multilingual. Furthermore, the source of these parallel corpora is often covered a restricted range of text types such as legislation, administration, and technical documentation. There is a limited coverage of the available multilingual parallel corpora which restricted only to high-density languages such as English and the European languages. Although the importance of the Arabic language, it is not involved in most of the relevant multilingual parallel corpora. To our knowledge, there are some small bilingual/multilingual corpora including Arabic. In this paper, we highlight the value of the multilingual aligned parallel corpora in various NLP applications by contributing to the enrichment of multilingual resources especially those that involved the Arabic language. We decide to build a new multilingual aligned parallel corpus taking advantage of the growth of online databases of videos subtitles from TEDx Talks events. As a result, we are currently building a corpus which contains approximately 6,000 multilingual of aligned subtitles covering 200 talks that comprise a variety of topics with altogether more than 10 million words. This corpus is divided into two sub corpora. The first one contains about 200 files for 15 languages and the second one is available in 30 languages with an average size of roughly 100 files per language. Besides the introduction, the paper consists of four sections. In the second section, we will provide a number of relevant parallel corpora. In section three, we will describe our methodology of work and the adopted criteria to filter the database collected. Further, we will mention a couple of approaches that attempted to handle the sentence’s alignment. In section four, we will mention the current results achieved with their advantages. Finally, we reach the conclusion and our perspective for the next stages of this work. II. ST ATE OF THE ART In this section, we present some popular parallel corpora for both bilingual and multilingual. As we mentioned, a range of parallel corpora, in fact, tend to be bilingual. For example, there is French-English parallel corpus [10], Japanese-English [11], Czech-English [12], Persian-English [13], Portuguese-English and Portuguese- Spanish [14]. Furthermore, the sources of these parallel corpora often cover restricted topics such as legislation (e.g. debates of the European parliament), administration documents and technical documentation like OS software manuals. To our knowledge, OPUS is probably the largest collection of freely available parallel corpora in different languages with a considerable size and variety [15]. For instance, it contains