Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations, pages 86–89, Dublin, Ireland, August 23-29 2014. NTU-MC Toolkit: Annotating a Linguistically Diverse Corpus Liling Tan Universität des Saarland Campus, 66123 Saarbrücken, Germany alvations@gmail.com Francis Bond Nanyang Technological University 14 Nanyang Drive, Singapore 637332 bond@ieee.org Abstract The NTU-MC Toolkit is a compilation of tools to annotate the Nanyang Technological University - Multilingual Corpus (NTU-MC). The NTU-MC is a parallel corpora of linguistically diverse languages (Arabic, English, Indonesian, Japanese, Korean, Mandarin Chinese, Thai and Viet- namese). The NTU-MC thrives on the mantra of "more data is better data and more annotation is better information". Other than increasing parallel data from diverse language pairs, annotat- ing the corpus with various layers of information allows corpora linguists to discover linguistic phenomena and provides computational linguists with pre-annotated features for various NLP tasks. In addition to the agglomeration existing tools into a single python wrapper library, we have implemented three tools (Mini-segmenter, GaChalign and Indotag) that (i) pro- vides users with varying analysis of the corpus, (ii) improves the state-of-art performance and (iii) reimplements a previously unavailable annotation tool as a free and open tool. This paper brieﬂy describes the wrapper classes available in the toolkit and introduces and demonstrates the usage of the Mini-segmenter, GaChalign and Indotag. 1 Introduction The NTU-MC Toolkit was developed in conjunction with the compilation of the Nanyang Technological University - Multilingual Corpus (NTU-MC) (Tan and Bond, 2012). It is an agglomeration of existing state-of-art tools into a single python wrapper library. The NTU-MC Toolkit provides python wrapper classes for tokenizers and Part-of-Speech (POS) taggers for the respectively languages: • Stanford Segmenter and POS taggers (Arabic and Chinese) • POSTECH POSTAG/K tagger (Korean) • tinysegmenter and MeCab (Japanese) • JVnTextPro (Vietnamese) Additionally, we implemented three tools to provide complementary or better annotations, viz.: • Mini-segmenter (Chinese): Dictionary based Chinese segmenter • GaChalign (Crosslingual): Gale-Church Sentence-level Aligner with variable parameters • Indotag (Indonesian): Conditional Random Field (CRF) POS tagger. The following sections of the paper will brieﬂy describe the wrapper classes available in the toolkit (Sec- tion 2) and introduce and demonstrate the usage of the Mini-segmenter (Section 3), GaChalign (Section 4) and Indotag (Section 5). This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ 86