Fluent speech prosody: Framework and modeling Chiu-yu Tseng a, * , Shao-huang Pin a , Yehlin Lee a , Hsin-min Wang b , Yong-cheng Chen b a Phonetics Lab, Institute of Linguistics, Academia Sinica, Taipei, Taiwan b Institute of Information Science, Academia Sinica, Taipei, Taiwan Received 16 September 2004; received in revised form 10 March 2005; accepted 28 March 2005 Abstract The prosody of ﬂuent connected speech is much more complicated than concatenating individual sentence intona- tions into strings. We analyzed speech corpora of read Mandarin Chinese discourses from a top–down perspective on perceived units and boundaries, and consistently identiﬁed speech paragraphs of multiple phrases that reﬂected dis- course rather than sentence eﬀects in ﬂuent speech. Subsequent cross-speaker and cross-speaking-rate acoustic analyses of identiﬁed speech paragraphs revealed systematic cross-phrase prosodic patterns in every acoustic parameter, namely, F 0 contours, duration adjustment, intensity patterns, and in addition, boundary breaks. We therefore argue for a higher prosodic node that governs, constrains, and groups phrases to derive speech paragraphs. A hierarchical multi-phrase framework is constructed to account for the governing eﬀect, with complimentary production and perceptual evidences. We show how cross-phrase F 0 and syllable duration patterns templates are derived to account for the tune and rhythm characteristic to ﬂuent speech prosody, and argue for a prosody framework that speciﬁes phrasal intonations as subjacent sister constituent subject to higher terms. Output ﬂuent speech prosody is thus cumulative results of contri- butions from every prosodic layer. To test our framework, we further construct a modular prosody model of multiple- phrase grouping with four corresponding acoustic modules and begin testing the model with speech synthesis. To conclude, we argue that any prosody framework of ﬂuent speech should include prosodic contributions above individual sentences in production, with considerations of its perceptual eﬀects to on-line processing; and development of unlimited TTS could beneﬁt most appreciably by capturing and including cross-phrase relationships in prosody modeling. Ó 2005 Published by Elsevier B.V. Keywords: Prosodic phrase grouping; Top–down; PG; Prosodic hierarchy; Multi-phrase; Cross-phrase; Constraints; Templates; Speech planning; Look-ahead; Global F 0 templates; Temporal allocations; Syllable duration patterns; Intensity distribution; Boundary breaks 0167-6393/$ - see front matter Ó 2005 Published by Elsevier B.V. doi:10.1016/j.specom.2005.03.015 * Corresponding author. Tel.: +886 2 27863300x222; fax: +886 2 2652 3133. E-mail addresses: cytling@sinica.edu.tw (C. Tseng), whm@iis.sinica.edu.tw (H. Wang). Speech Communication 46 (2005) 284–309 www.elsevier.com/locate/specom