Perceptually-related Acoustic-Prosodic Features of Phrase Finals in Spontaneous Speech Carlos Toshinori Ishi, Parham Mokhtari & Nick Campbell JST/CREST at ATR/Human Information Science Labs [carlos,parham,nick]@atr.co.jp Abstract With the aim of automatically categorizing phrase final tones, investigations are conducted on the relationship between acoustic-prosodic parameters and perceptual tone categories. Three types of acoustic parameters are proposed: one related to pitch movement within the phrase final, one related to pitch reset prior to the phrase final, and one related to the length of the phrase final. A classification tree is used to evaluate automatic categorization of phrase final tone types, resulting in 76% correct classification for the best combination among the proposed acoustic parameters. Experiments are also conducted to verify the perceived degree of pitch change within a phrase final, and the perceived degree of pitch reset. While a good relationship is found between the perceptual scores and some of the acoustic parameters, our results also advocate a continuous rather than a categorical relationship between some of the phrase final tone-types considered. 1. Introduction Phrase finals in Japanese utterances convey both linguistic and paralinguistic information. For example, they convey grammatical information such as modality (declarative vs. interrogative), focus, punctuation of phrase boundaries, and continuity of the sentence. They also convey important paralinguistic information such as the manner and attitude of the speaker. In linguistics and phonetics research, there have been many proposals for categorizing sentence final intonation [1,2,3]. However, such methods are usually based on auditory perception and rarely extend to an automatic categorization of intonation types. Phrase finals usually have greater prosodic variability in spontaneous speech than in read speech. While the X-JToBI [4] labeling method was proposed in order to more adequately describe such variability, automatic labeling is still not possible. A goal of the present research is therefore automatic prosodic labeling of phrase finals in a large database of spontaneous, expressive speech collected in the JST/CREST ESP Project [5]. In this paper we focus on the description and automatic classification of phrase final prosody, specifically by analyzing the relationship between tone categories perceived by humans and acoustic-prosodic features extracted from the speech signal. 2. Analysis unit and definition of phrase finals Our speech database consists of natural daily conversations recorded as part of the JST/CREST ESP Project. We used the prosodic phrase as the utterance unit for analysis. The prosodic phrases were segmented semi-automatically, boundaries being placed at evident pauses or pitch resets. For analysis, we used 404 phrases taken from three natural conversations with family members and with business people. In this paper, phrase final is defined as the V (vowel) portion, or the VN (vowel + syllable-final nasal) portion of the last syllable of the phrase, i.e., the last syllable excluding the initial consonant. This definition is compatible with the perceptual rhythmic beat position (Perceptual Center, or P- Center) which is considered to be close to the vowel onset [6]. These rhythm properties have also been reported for Japanese speech [7,8]. The segmentation of the phrase finals was realized semi- automatically, using power and periodicity properties of the speech signal. 3. Categorization and labeling of phrase finals Table 1 shows relations between the categorizations of sentence final particle tone types proposed by [3], and boundary pitch movement (BPM) labels proposed in the X- JToBI framework [4]. In the examples, indicates a pitch fall, indicates a pitch reset, and indicates a rising pitch movement. Table 1. Categorization of phrase final tones and corresponding phrase boundary pitch movement (BPM) labels. Tone Type[3] Perceptual Properties Example X-JToBI BPM [4] 1a Low na i ne L% 1b Low + Falling tone na i ne e L% 2a High na i ne L%+H% 2b High + Lengthened na i nee L%+H%> 2c Low + Rising tone na i ne L%+LH% 3 High + Falling tone na i ne e L%+HL% 5 High + Fall- Rise tone na i ne e L%+HLH% The categories shown in Table 1 are perceived as distinct tone-types that convey distinct paralinguistic functions in Japanese [1,2], but in this paper we focus on the problem of tone-type categorization, disregarding their functional properties. In particular, we adopt the tone category labels proposed by [3] (cf. Table 1). In addition to this basic set of labels, we found it necessary to include a tag “E” (Extended) after the tone category label to mark situations when the phrase final is very lengthened; the functional properties of such phrase final lengthening are also pointed out in [1]. Furthermore, a tag “S” (Short) was added for the 2c category, when the rising curvilinearity perceived in the pitch movement within the phrase final is particularly short and fast; the short versus long distinction within the 2c category is also expressed with