Effects of Focus on Prosody of Cantonese Speech – A Comparison of Surface Feature Analysis and Model-Based Analysis Wentao Gu and Tan Lee Department of Electronic Engineering, the Chinese University of Hong Kong, Hong Kong {wtgu, tanlee}@ee.cuhk.edu.hk Abstract The effects of focus on F 0 contours and syllable durations of spoken Cantonese are investigated through a controlled experiment involving all the disyllabic tone pairs. Regardless of the tones, syllable durations in the vicinity of the on-focus syllable are lengthened in systematically varying rates. For F 0 pattern, two approaches are compared: analysis of surface features from time-normalized F 0 contours, and analysis-by- synthesis of time-intact F 0 contours based on the command- response model. The effects of focus on surface F 0 features, if any, include raise of F 0 values and expansion of F 0 ranges, which are similarly shown in pre-focus, on-focus, and post- focus domains. These surface phenomena can be explained better by tone and phrase commands in the model. Finally, a general framework is proposed by integrating these analyses. Index Terms: Cantonese, focus, F 0 contour, syllable duration, tone, command-response model 1. Introduction An accurate and linguistically meaningful representation of speech prosody is necessary for synthesis of high-quality speech for various languages. However, this problem is rather difficult because surface prosodic features (such as F 0 contour, syllable duration, and source intensity) of speech show great variations, which are not random but play an important role in conveying not only linguistic but also para- and non-linguistic information. It is especially the case for tone languages, of which F 0 contours of continuous speech are largely constrained by lexical tones but at the same time deviate from the canonical forms of tones in each syllable due to the effects of contextual factors transmitting a variety of information. Since paralinguistic factors are consciously controlled by speakers, they are crucial for expressive speech synthesis. Among paralinguistic factors, emphatic focus (or ‘focus’ for short) is especially important in speech communication. Many studies have shown that syllables under focus tend to be lengthened, but most of the studies did not investigate the durational effect in a wider range. For F 0 , Xu’s study on Mandarin [1] showed that focus gives different effects on F 0 ranges in three distinct domains, viz., on-focus expansion, post-focus suppression (i.e. lowering and compression), and pre-focus intactness; hence, focus brings about substantial F 0 downtrends. Recently, Jia et al. [2] further clarified that such expansion and suppression of F 0 range result mostly from raising and lowering high pitch targets, while low pitch targets are hardly affected by focus. In the present study we investigate the effects of focus on F 0 and syllable duration of Cantonese, a tone language with more lexical tones than Mandarin. Since source intensity is believed to be less important for speech synthesis, it is not discussed in the present study, though apparently the source intensity of a syllable under focus tends to be magnified. For the target representation for analysis, there is little controversy for syllable duration. For F 0 , however, there can be two approaches. Most previous works inspect surface F 0 features such as onset/offset/peak/valley/mean F 0 values, slope of F 0 curve, or range of F 0 values in a target syllable. However, such a kind of analysis has three drawbacks. First, it does not separate global phrase intonation and local tone patterns explicitly, and hence only gives a confounded result. Second, the surface measurements are phenomenological and cannot capture the essential characteristics of F 0 movements efficiently. Third, they are vulnerable to microprosody and noises in F 0 . Hence, the other approach is to give a parametric representation of F 0 contour by employing a mathematical model. We shall try both approaches and make a comparison. 2. Cantonese tone and intonation Cantonese has 9 lexical tones, including 6 non-entering tones and 3 entering tones – the syllables of entering tones have an unreleased stop coda /p/, /t/, or /k/, and hence are shorter than those of non-entering tones. Since the three entering tones show similar F 0 patterns with the respective non-entering level tones and can be distinguished from the latter by syllable structure, some schemes like Jyutping in HK define only six tones, by merging non-entering and entering ones. Table 1 gives a few sets of descriptions of the six tones of Cantonese. The third column shows the conventional 5-level tone code system for a phonetic notation of tones. The levels, from 1 to 5, indicate relative pitch targets from low to high, respectively. The rightmost column gives the tone command patterns proposed recently [3] on the basis of the command- response model [4], as will be discussed in Section 4. There have been a few studies on the effects of focus in Cantonese. Man [5] claimed that it does not affect tone identities but lengthens syllable duration and expands F 0 range, whereas Gu et al.’s study [6] based on the command- response model showed that the major effect on F 0 is an increase of phrase command before the on-focus syllable (hence a wide-range increase of F 0 thereafter) but the basic patterns of tone commands are preserved. In a recent study [7], we investigated the joint effects of tonal context and focus on surface F 0 features, through a deliberate control of the domain of focus (as narrow as a single syllable). In this work, we shall extend the study to syllable duration, and also investigate the effects on F 0 by a model-based approach. Table 1. Descriptions of Cantonese tone system. Tone type Pitch feature Tone code Command pattern T1 high level 55 + + T2 high rising 35 or 25 - + T3 mid level 33 0 0 T4 low falling 21 or 11 = = T5 low rising 13 or 23 - 0 T6 low level 22 - -