ICPhS XVII Regular Session Hong Kong, 17-21 August 2011 2054 CHARACTERIZATION OF HESITATIONS USING ACOUSTIC MODELS Arlindo Veiga a,c ; Sara Candeias a ; Carla Lopes a,b,c & Fernando Perdigão a,c a Polo de Coimbra, Instituto de Telecomunicações, Coimbra, Portugal; b ESTG, Campus 2, Instituto Politécnico de Leiria, Leiria, Portugal; c DEEC, Polo II, Universidade de Coimbra, Coimbra, Portugal aveiga@co.it.pt; saracandeias@co.it.pt; calopes@co.it.pt; fp@co.it.pt ABSTRACT Spontaneous speech is full of hesitations, such as fillers, word cut-offs, repetitions and segmental extensions. Automatic identification of such hesitations has several applications; however, it is a challenging research problem. In this paper acoustic-phonetic properties of hesitation phenomena are explored in order to identify and annotate some of these events in a spontaneous speech corpus of Portuguese broadcast television news. Based on pitch, energy, spectral and durational characteristics of the filled pauses and segmental extensions during their production, we intend to characterize the acoustic-phonetic regularity of the phenomena. A speech recognition sys-tem was used to help locating the filled pauses and extensions. The events detected were then manually validated. Our preliminary results suggest that there are regular trends in the production of these hesitation events, which could distinguish them from other events within the structure of Portuguese. Our purpose with this work is to improve acoustic modeling for spontaneous speech recognition systems. Some insights into the process of human speech communication for Portuguese are gained as well. Keywords: filled pauses, extensions, acoustic- phonetic features, Portuguese spontaneous speech 1. INTRODUCTION Spontaneous and read speech have diverse structures both acoustically and syntactically. The presence of hesitations such as filled pauses, extensions, repetitions and word cut-offs is very common in spontaneous speech and plays an important role in the structuring of speech [14]14, 24]. Hesitation events can be used to identify the idiosyncrasy of the speakers and also to improve the performance of automatic speech recognition systems. In this study we concentrated on both the filled pauses (FPs) and extensions (EXs), present in spontaneous Portuguese speech. FPs comprise all sounds that phonetically belong to the Portuguese language but do not occur in the context of a complete word (e.g., uum, aaa, eee). With EXs we mean the phonetic prolongation into both functional and lexical words (e.g. [ɐ] in <para> or the [u] in <do>). FPs and EXs for En- glish and other languages is an issue that has been widely addressed by the scientific community (e.g. in [4, 5, 6, 8 12, 24]). Nevertheless for Portuguese, only a few language studies focus on this problem. Although the main topic of the work by Freitas [9] and Delgado-Martins [7] is not FPs phenomena, they show that duration features and syntactic information are responsible for the distinction between spontaneous speech, oral and reading presentations. In Mata [15], following Moniz and her colleagues [18], FPs characteristics are presented to demonstrate the contribution of the fundamental frequency trend for on-line planning efforts both in spontaneous speech and in oral reading. An important report about the distinction between fluency and disfluencies in the communication process within a teaching context is presented by Moniz in [19]. Despite interesting conclusions, the study was based on a sample that is scarcely representative of the phenomenon. Updates made in [18] and [21], exploring prosodic cues in an attempt to classify (dis)fluency, seem to confirm the limited representativeness of the phenomenon. The acoustic-phonetic characterization of FPs and EXs will certainly lead to improved speech recognition systems. In fact, the presence of hesitations in speech signals negatively affects the performance of the automatic speech recognition (ASR) systems. Dealing with this problem beco- mes a challenge to the recognizers and various techniques have been proposed to find hesitations in the speech signal, including FPs. Some studies deal with identifying the strict location in time of the hesitation event (e.g. [2, 16, 28]), while others