SPIRAL CONSTRUCTION OF SYNTACTICALLY ANNOTATED SPOKEN LANGUAGE CORPUS Tomohiro Ohno † , Shigeki Matsubara ‡ , Nobuo Kawaguchi ‡ and Yasuyoshi Inagaki § Graduate School of Information Science, Nagoya University † Information Technology Center/CIAIR, Nagoya University ‡ Faculty of Information Science and Technology, Aichi Prefectural University § Furo-cho, Chikusa-ku, Nagoya, 464-8603 Japan ohno@inagaki.nuie.nagoya-u.ac.jp ABSTRACT Spontaneous speech includes a broad range of linguis- tic phenomena characteristic of spoken language, and therefore a statistical approach would be effective for robust parsing of spoken language. Though a large- scale syntactically annotated corpus is required for the stochastic parsing, its construction requires a lot of human resources. This paper proposes a method of efficiently constructing a spoken language corpus for which the dependency analysis is provided. This method uses an existing spoken language corpus. A stochastic dependency parse is employed to tag spoken language sentences with the dependency structures, and the results are corrected manually. The tagged corpus is constructed in a spiral fashion where in the corrected data is utilized as the statistical information for automatic parsing of other data. Taking this spi- ral approach reduces the parsing errors, also allowing us to reduce the correction cost. An experiment using 10,995 Japanese utterances shows the spiral approach to be effective for efficient corpus construction. Keywords: Stochastic parsing, Dependency parsing, Language database, Spoken dialogue corpus 1. INTRODUCTION A large-scale text corpus for which the syntactic brack- eting information is provided plays an important role in natural language processing. In fact, the various lan- guages’ parse-trees data of written language such as that used in newspapers and magazines, for instance, Penn Treebank [6], NEGRA Treebank [13], TIGER Treebank [8], Prague Dependency Treebank [3], Ky- oto corpus [9], EDR corpus [2], etc., have been widely utilized not only for language parsing, but also for information retrieval, automatic summarization, ma- chine translation, and so on. In these corpus, the EDR corpus and the Kyoto corpus are syntactically anno- tated corpora for Japanese language, and were built by sufficiently considering various kinds of syntactic fea- tures peculiar to Japanese language. On the other hand, turning our attention to those of spoken language, de- spite the fact that we can enumerate the Switchboard corpus [5], Verbmobil Treebanks [1], Spoken Dutch Corpus [11], etc., very few attempts have been made for Japanese spoken language so far. Constructing a large-scale syntactically annotated corpus of spontaneously spoken language and utilizing it as the statistical information would be effective for developing a robust spoken language parsing. Since manually providing the annotation for a Japanese text corpus calls for several difficult tasks such as morpho- logical analysis, bunsetsu segmentation, and depen- dency analysis 1 , it therefore requires considerable hu- man resources. This paper describes spiral construction of a spo- ken language corpus in which a dependency structure is given to each utterance. A stochastic dependency parser is utilized for automatic annotation to construct the corpus at a lower cost; that is, our approach to cor- pus construction is to alternately provide the depen- dency analyses automatically and repair it manually. The key to this approach is parsing: the parser is based on statistical information, so the more the learning data there is, the more precise the parsing. It can be ex- pected that the data would be corrected less in the spi- ral construction than that in the non-spiral one. Stochastic dependency parsing was developed for 1 A bunsetsu is one of the linguistic units in Japanese, and roughly corresponds to a basic phrase in English. A bunsetsu con- sists of one independent word and more than zero ancillary words. A dependency is a modification relation between two bunsetsus.