Proceedings of Disfluency in Spontaneous Speech, DiSS 2013 41 Prediction of F0 height of filled pauses in spontaneous Japanese: a preliminary report Kikuo Maekawa National Institute for Japanese Language and Linguistics, Japan Abstract F0 values of filled pauses (FP) in the Corpus of Spontaneous Japanese were analyzed to examine the mechanism by which the F0 heights of FP were determined. Statistical analyses of the F0 values of FP occurring in between two full-fledged accentual phrases (AP) revealed correspondence between the occurrence timing of FP and the F0 height. Based upon this finding, 5 models of F0 prediction were proposed. Comparison of the mean prediction errors revealed that the best prediction was obtained in a model that linearly interpolate the phrase-final L% tone of the immediately preceding AP and the phrase-initial %L tone of the immediately following AP. This finding suggests that the F0 of FP was specified at the level of phonetic realization rather than phonological prosodic representation. 1. Introduction Frequent occurrence of filled pauses (FP hereafter) is one of the most salient characteristics of spontaneous speech. There’s a wide consensus among the researchers that FP play positive roles in the processing of spontaneous speech. The supposed cognitive roles of FP include prognosis of the perplexity of upcoming word [1], or the complexity of the upcoming clause [2], marking of discourse structure [3], discourse management [4], indication of the degree of factuality of university lectures [5], etc. There are also speech analytic studies on the phonetic characteristics of FP ([6] among others), and, applications- oriented studies including synthesis of dialogue speech [7], recognition of spontaneous speech [8], etc. Despite its cognitive importance, mechanisms of FP production are left mostly untouched in the study of speech production. In the study of speech prosody, for example, existing theories of prosodic structure do not pay any attention for the intonational or other prosodic characteristics of FP [9]. The lack of scientific knowledge in this field poses, accordingly, serious limitations on the design of prosodic annotation schema for spontaneous speech. In the X-JToBI annotation scheme, which was proposed for the prosodic annotation of spontaneous speech [10], FP are treated as a special kind of accentual phrase (AP hereafter) whose pitch height is specified tonally either as FH (‘filler- high’) or FL (‘filler-low’). This binary labeling, however, was not proposed on a firm theoretical basis. It is rather a simple extrapolation of established knowledge about the prosody of Japanese that L and H are required for the specification of linguistic contrast and pragmatic information. There is no a priori reason to believe that FP are specified with respect to binary, or whatever, tonal opposition. In the rest of this paper, corpus-based analyses of FP will be conducted in terms of their location in utterance, timing with respect to adjacent AP, and, F0 height, to know if it is possible to predict the F0 height of FP from their occurrence environment. 2. The data The ‘Core’ part of the Corpus of Spontaneous Japanese (CSJ hereafter), which is X-JToBI annotated, was used for analyses [11]. 44 hours of speeches containing about half a million words are included in the CSJ-Core. FP in the CSJ-Core are marked not only in the X-JToBI annotation, they are also marked in the speech transcriptions. Since the criteria of FP recognition are not identical in the prosodic annotation and speech transcription, the total number of FP do not coincide in the prosodic annotation and transcription. The main difference stems from the treatment of a FP (/de/ see Table 1) occurring in the beginning of utterance, which is treated as a FP in prosodic annotation, while it is treated as an ordinary conjunctive in speech transcription. In the present study, FP were recognized according to the criteria of the X-JToBI scheme. The total number of FP analyzed in this study was 35,164. As for the textual property, 160 different textual shapes were recognized in the speech transcription of the FP in CSJ-Core. Since this classification is too detailed for the present analyses, FP were reclassified into 23 classes based upon the similarity of their segmental shapes. These classes were further reclassified into 8 classes. The results of two-way classifications are shown in Table 1 as Class1 and Class2 respectively. Note that FP whose occurrence frequencies were less than 10 were omitted from the classifications. Table 1: Textual classification of FP.