Article No. 10.1006/csla.2000.0141 Available online at http://www.idealibrary.com on Computer Speech and Language (2000) 14, 177–210 ProSynth: an integrated prosodic approach to device-independent, natural-sounding speech synthesis Richard Ogden,Sarah Hawkins,Jill House,§ Mark Huckvale,§ John Local,Paul Carter,Jana Dankoviˇ cov´ a § and Sebastian Heid University of York, University of Cambridge, §University College, London Abstract This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the acoustic richness of the speech signal reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by paying attention to systematic phonetic detail in the spectral, temporal and intonational domains produces a perceptually robust signal that is intelligible in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic–phonetic detail of real speech. We present examples of our approach to modelling systematic segmental, temporal and intonational detail and show how all are integrated in the prosodic structure. Preliminary tests to evaluate the effects of modelling systematic fine spectral detail, timing, and intonation suggest that the approach increases intelligibility and naturalness. c 2000 Academic Press 1. Introduction Speech synthesized by rule has yet to make a significant impact as an output channel for in- formation systems, despite continued engineering advances in text-to-speech (TTS) systems. A recurrent complaint is the perceived “unnatural” quality of the synthetic speech: that the speech does not sound as if it could have been produced by a human speaker. Such problems persist despite improvements in textual analysis, pronunciation and signal generation. For example: although the use of a large corpus of recorded speech for polyphone concatenation has produced signals with sections with a highly natural voice quality, utterances still ex- hibit disfluencies, broken rhythm and lack of coherence. Contemporary synthetic speech still suffers from unexpressive and often inappropriate prosody, and from poorly modelled coar- ticulation. These failings arise from the poverty of the linguistic representation underlying the utterance to be produced, as well as a fundamental lack of attention to the systematic fine detail in human production—fine detail that listeners expect and also utilize when listening in noise. 0885–2308/00/030177 + 34 $35.00/0 c 2000 Academic Press