Building The Sense-Tagged Multilingual Parallel Corpus Shan Wang, Francis Bond Division of Linguistics and Multilingual Studies, Nanyang Technological University, Singapore wangshanstar@gmail.com, bond@ieee.org Abstract Sense-annotated parallel corpora play a crucial role in natural language processing. This paper introduces our progress in creating such a corpus for Asian languages using English as a pivot, which is the first such corpus for these languages (Chinese, Japanese and Indonesian). Two sets of tools have been developed for sequential and targeted tagging, which are also easy to be set up for any new languages. This paper also briefly presents the general guidelines for doing this project. The current results of the monolingual sense- tagging and multilingual linking are illustrated, which indicate the differences among genres and language pairs. All the tools, guidelines and the manually annotated corpus will be freely available at http://compling.ntu.edu.sg/ntumc . Keywords: sense-tagging, multilingual corpus, parallel corpus 1. Introduction Semantically annotated corpora are of significant values in natural language processing. In particular, sense annotated corpora based on Princeton Wordnet (Fellbaum 1998) have been widely developed (Petrolito & Bond 2014). One such corpus is English SemCor (Landes et al. 1998), which is among the early sense-tagged corpora. After it was created, Italian, Romanian and Japanese translations of it have been made and sense-tagged (Bentivogli & Pianta 2005; Lupu et al. 2005; Tan & Bond 2012). Such kind of Semcors have been used in a large number of tasks (Kilgarriff 1998; Gonzalo et al. 2000; Navigli et al. 2003; Gutiérrez et al. 2011). However, there is no such resource for Asian languages. Instead of translating the English SemCor to Asian languages, we made use of the Nanyang Technological University Multilingual Corpus (NTU-MC) which contains 595,000 words (26,000 sentences) in seven languages (Arabic, Chinese, English, Indonesian, Japanese, Korean and Vietnamese) from seven language families (Afro-Asiatic, Sino-Tibetan, Indo-European, Austronesian, Japonic, Korean as a language isolate and Austro-Asiatic) (Tan & Bond 2012; Bond et al. 2013). We selected four of these languages for further annotation: English, Chinese, Japanese, and Indonesian. The corpus of each language was first manually sense tagged with Princeton Wordnet (Fellbaum 1998), Chinese Open Wordnet (Wang & Bond 2013a, 2013b), Japanese Wordnet (Isahara et al. 2008) and Wordnet Bahasa (Nurril Hirfana et al. 2011), and then linked to the English corpus at the concept level respectively (Bond et al. 2013; Bond & Wang 2014). To the best of our knowledge, this is the first such multilingual corpus for these Asian languages. All the tools, guidelines and annotated corpus will be freely available at http://compling.ntu.edu.sg/ntumc . By doing this project, we aim to provide a useful resource for the community. The following sections are arranged as follows. Section 2 introduces the tools, guidelines and quality control of the corpus. The current results of the annotated corpus are illustrated in Section 3. Section 4 summarizes this paper and gives directions for future work. 2. Building Sense-tagged Multilingual Parallel Corpora Though there are some parallel corpora (Koehn 2005; Cyrus 2006; Čulo et al. 2008; Volk et al. 2010) and sense-tagged corpora (Ng & Lee 1996; Mingqin et al. 2003), multilingual sense-tagged corpora are rare. The only one we know of is English SemCor and its translations into Italian, Romanian and Japanese. This project aims for creating a sense-tagged parallel corpus for Asian languages by utilizing the texts of NTU-MC. The current size of the corpus we are tagging is shown in Table 1. There are 7,093 sentences in the English texts, which are translated into Chinese, Japanese and Indonesian, making a total of 22,762 sentences. Words are all the tokens, while concepts refer to content words and multiword expressions (MWE). The actual number is changing as the project goes on. With this project going on, we are aware of the respects which can speed up the development of such tasks: (i) convenient annotation tools, (ii) clear and detailed guidelines, (iii) follow-up checking to guarantee quality control. All data are manually annotated by trained linguistic students. 2.1 Annotation Tools We developed two sets of annotation tools: one for sequential/textual tagging (sentence by sentence) and one for targeted/lexical tagging (word by word) (Langone et al. 2004). The former is illustrated in Figure 1, which embeds 2403