Linguistic studies using large annotated corpora: Introduction Hiroki NOMOTO and David MOELJADI Tokyo University of Foreign Studies and Palacký University Olomouc nomoto@tufs.ac.jp, david.moeljadi@upol.cz 1. Background 1 Corpora have been used widely in modern linguistic research. Two notable features of corpus development in recent years are a significant increase in size and various kinds of annotations. Billion-size corpora are not uncommon nowadays. Efforts have been made to enrich raw texts with linguistic information, such as morphology, parts of speech (POS), constituent structure, semantic dependency, information and discourse structural status and so on. However, these developments, which took place primarily in the field of natural language processing, have not been maximally utilized in the linguistic research of languages in Nusantara. This NUSA special issue was planned to encourage researchers to explore the available re- sources and share ways of using them to investigate old and new empirical and theoretical topics. We solicited submissions openly by means of an official call for papers. In the call for papers, we provided the list of available resources (1) and requested that all manuscripts explicitly state what resource(s) they used and how they utilized the an- notations. Besides the suggested annotated corpora, the authors were also allowed to build their own corpus by annotating a raw corpus using a morphological dictionary (e.g. MALINDO Morph 2 ), a POS tagger (e.g. MorphInd, 3 Rule-Based POS Tagger Bahasa Indonesia 4 , an HPSG grammar (e.g. INDRA 5 ) and so on. (1) Examples of large annotated corpora a. MALINDO Conc (Nomoto, Akasegawa & Shiohara 2018a) (https://malindo.aa-ken.jp/conc/) Reclassified version of the Leipzig Corpora Collection (Goldhahn, Eckart & Quasthoff 2012; Nomoto, Akasegawa & Shiohara 2018b) 6 morphological annotation; Malay, Indonesian; 3 million words b. Korpus Indonesia (KOIN) (Kwary 2018) (https://korpusindonesia.kemdikbud.go.id/) 1 Acknowledgements This work was supported in part by JSPS KAKENHI Grant Number JP18K00568. We would like to express our sincere gratitude to all those involved in the publication process, especially the eight peer reviewers who devoted their precious time and expertise in their respective fields to improve the quality of this volume. All remaining errors are ours. 2 Nomoto et al. (2018), https://github.com/matbahasa/MALINDO_Morph 3 Larasati, Kuboˇ n & Zeman (2011), http://septinalarasati.com/morphind/ 4 Rashel et al. (2014), https://github.com/andryluthfi/indonesian-postag 5 Moeljadi, Bond & Song (2015), http://moin.delph-in.net/IndraTop 6 http://wortschatz.uni-leipzig.de/en/ NOMOTO, Hiroki and David MOELJADI, 2019. ‘Linguistic studies using large an- notated corpora: Introduction’. In Hiroki Nomoto and David Moeljadi, eds. Lin- guistic studies using large annotated corpora. NUSA 67: 1–6. [Permanent URL: http://repository.tufs.ac.jp/handle/10108/94450] [doi: 10.15026/94450]