Partial parsing of spontaneous spoken French Olivier Blanc 1 , Matthieu Constant 2 , Anne Dister 1,3 , Patrick Watrin 1 1 Universit´ e de Louvain, Belgium 2 Universit´ e Paris-Est, LIGM & CNRS, France 3 Facult´ es universitaires Saint-Louis, Belgium olivier.blanc@uclouvain.be, mconstan@univ-mlv.fr, anne.dister@uclouvain.be,patrick.watrin@uclouvain.be Abstract This paper describes the process and the resources used to automatically annotate a French corpus of spontaneous speech transcriptions in super-chunks. Super-chunks are enhanced chunks that can contain lexical multiword units. This partial parsing is based on a pre- processing stage of the spoken data that consists in reformatting and tagging utterances that break the syntactic structure of the text, such as disfluencies. Spoken specificities were formalized thanks to a systematic linguistic study of a 40-hour-long speech transcription corpus. The chunker uses large-coverage and fine-grained language resources for general written language that have been augmented with resources specific to spoken French. It consists in iteratively applying finite-state lexical and syntactic resources and outputing a finite automaton representing all possible chunk analyses. The best path is then selected thanks to a hybrid disambiguation stage. We show that our system reaches scores that are comparable with state-of-the-art results in the field. 1. Introduction Large annotated corpora of transcribed spontaneous speech are of great interest for many fields of Natural Language Processing. Nevertheless, their manual construction is painful and requires automatic tools. This paper describes the process and the resources used to automatically anno- tate a French corpus of spontaneous speech transcriptions in super-chunks. Super-chunks are enhanced chunks that can contain lexical multiword units. This partial parsing is based on a preprocessing stage of the spoken data that con- sists in reformatting and tagging utterances that break the syntactic structure of the text, such as disfluencies. The chunker uses large-coverage and fine-grained lexical re- sources for general written language that have been aug- mented with resources specific to spoken French. In sec- tion 2., we describe the corpus used and its specificities. In section 3., we show how we dealt with them during the pre- processing stage. We then describe the architecture of our chunker (section 4.) as well as the language ressources used (section 5.). The last section is dedicated to the evaluation of the whole process. We show that our chunker reaches scores comparable with state-of-the-art for French. 2. Spoken Corpus The corpus we used is a sub-corpus extracted from the spo- ken textual data bank of Valibel. It includes 60 texts tran- scribed from spoken conversations, composed of 443,047 graphical words. They approximately correspond to 40 hours of spontaneous speech, all recorded in the French speaking part of Belgium. The talks are mainly semi- directed interviews and talks between friends, and have the characteristics of not being planned (as opposed to texts written to be spoken). More details about this cor- pus (speakers, sociolinguistic information, context of talks) can be found in (Dister, 2007). The transcriptions follow the guidelines that have been de- veloped at the Valibel research Center (Dister et al., 2006). The main principles are, in the most part, similar to those used in other laboratories 1 working on textual transcrip- tions of recorded speech. Firstly, words are transcribed using their standard spelling. Transcriptions do not con- tain any punctuation marks because the notion of sentence is not relevant for spoken language (Blanche-Benveniste and Jeanjean, 1987). The sound continuum, that has be- come linear with the transcription, is divided into speaking turns, defined by the change of speaker. The silent pauses are annotated subjectively by the transcriber with respect to three levels: short pause (/), long pause (//) and silence (///). Texts include phenomena that are specific to spoken language such as disfluencies and overlapping speech seg- ments, as illustrated in the example below: (1) blaAD1 avec une / une ba/ une barre qui bah tu es / tu es en l’air et puis tu te laisses glisser |- le long <blaNB1> ouais -| d’une barre blaAD1 with a / a ba/ a bar which bah you are / you are in the air and then you let yourself slide |- along <blaNB1> yeah -| a bar This transcription indicates that blaAD1 is speaking. The tag |- (resp. -|) starts (resp. ends) an overlapping segment, where blaNB1 says ouais when blaAD1 says le long. ba/ indicates a word (starting by ba) that has not been com- pleted. 3. Preprocessing of the spoken data In their present state, the transcriptions cannot be used as is in a chunker without significant modifications of the latter, because of the transcription format and the spoken speci- ficity of the data. The goal of the preprocessing module is to detect any phenomena that are specific to spoken lan- guage and normalize them so that they can be processed 1 cf. for instance, the DELIC corpus (DELIC, 2004) or data from the Rhapsodie project http://rhapsodie.risc.cnrs.fr/fr/index.html. 2111