Development of a Large Spontaneous Speech Database of Agglutinative Hungarian Language Tilda Neuberger, Dorottya Gyarmathy, Tekla Etelka Gráczi, Viktória Horváth, Mária Gósy, and András Beke Research Institute for Linguistics of the Hungarian Academy of Sciences Departement of Phonetics, Benczúr 33, 1068 Budapest, Hungary {neuberger.tilda,gyarmathy.dorottya,graczi.tekla,horvath.viktoria, gosy.maria,beke.andras}@nytud.mta.hu Abstract. In this paper, a large Hungarian spoken language database is intro- duced. This phonetically-based multi-purpose database contains various types of spontaneous and read speech from 333 monolingual speakers (about 50 minutes of speech sample per speaker). This study presents the background and motiva- tion of the development of the BEA Hungarian database, describes its protocol and the transcription procedure, and also presents existing and proposed research using this database. Due to its recording protocol and the transcription it provides a challenging material for various comparisons of segmental structures of speech also across languages. Keywords: database, spontaneous speech, multi-level annotation 1 Introduction Nowadays the application of corpus-based and statistical approaches in various fields of speech research is a challenging task. Linguistic analyses have become increasingly data-driven, creating a need for reliable and large spoken language databases. In our study, we aim to introduce the Hungarian database named BEA that provides a useful material for various segmental-level comparisons of speech also across languages. Hun- garian, unlike English and other Germanic languages, is an agglutinating language with diverse inflectional characteristics and a very rich morphology. This language is char- acterized by a relatively free word order. There are a few spoken language databases for highly agglutinating languages, for example Turkish [1], Finnish [2]. Language mod- eling of agglutinating languages needs to be different than modeling of languages like English [3]. There are corpora of various sizes, different numbers of speakers and di- verse levels of transcription. TIMIT Acoustic-Phonetic Continuous Speech Corpus was created for training speaker-independent speech recognizers. This database consists of sentence reading from 630 American English speakers; includes time-aligned ortho- graphic, phonetic and word transcriptions [4]. The Verbmobil database (of 885 speakers) was developed also in the 90’s with speech technological purposes [5]. The spoken part of the British National Corpus (100 million words) [6] consists of informal dialogues that were collected in different contexts, ranging from formal business or government meetings to radio shows. The London–Lund Corpus contains 100 texts of spoken British P. Sojka et al. (Eds.): TSD 2014, LNAI 8655, pp. 424–431, 2014. This is preprint prepared by Proceedings editor for Springer International Publishing Switzerland.