ON THE RELATIVE IMPORTANCE OF DIFFERENT PROSODIC FACTORS FOR IMPROVING SPEECH SYNTHESIS I.Bulyko , M.Ostendorf , and P.Price Boston University, Boston, MA SRI International, Menlo Park, CA ABSTRACT We present results of perceptual experiments geared toward assessing the relative importance of several prosodic factors in synthetic speech, showing that naturalness, relative to a target speaking style, can be significantly improved through both symbolic label prediction and better F0 and duration generation. Our experiments utilized a novel perceptual experiment paradigm, where we supply each test subject with two reference utterances in order to obtain reliable absolute scores that indicate magnitude of improvement. The approach gives ratings that are comparable across experiments. Results also show a strong interaction between detailed F0 and duration controls. 1. INTRODUCTION A growing number of speech recognition applications are creating an increasing demand for better quality speech output. Further, the possibility of generating speech from "concept" provides the opportunity for prosody to play an even more important role, not only in improving naturalness and intelligibility, but also in contributing to the perception of a particular speaking style. Many aspects of prosody could be improved, from placement of symbolic accent and phrase boundary markers to control of continuously varying parameters such as phone duration and fundamental frequency. In this paper we will assess relative contributions of different aspects of prosody towards improvements in achieving a target speaking style in the context of concatenative synthesis. Due to the subjective nature of speech perception, evaluation of synthetic speech is a difficult problem [8], and has been the subject of ongoing debate. A widely used approach to measuring speech naturalness is to ask the subjects for an indication whether one utterance was better, equal, or worse than another [2, 4]. Even though this relative scale reliably indicates differences, it fails to show magnitude of improvement. In this work we implement a perceptual experiment paradigm that involves supplying each test subject with two reference utterances, thereby making the scoring scale more quantitative and comparable across experiments. 2. PERCEPTUAL EXPERIMENTS 2.1. General Method Experiments were conducted in which, twelve naïve subjects, all native speakers of American English, listened to several versions of each of eight synthetic utterances and scored them on a 1-10 scale. The target speech was from a corpus of radio news stories, and all of the utterances were generated by the Entropic TrueTalk speech synthesizer [9] in a 16-bit 16kHz format. For each sentence, listeners were provided with two reference versions of the sentence and four versions to score. The references included the synthesizer's default text-to-speech (score 1 = least natural) and a version synthesized with natural phone durations and F0 contour as measured from a version of the sentence spoken in an actual radio news broadcast (score 10 = most natural). The use of reference versions was intended to help the subjects focus attention on prosodic differences rather than segmental quality, which would be the same in all six versions. Any pronunciation errors made by the synthesizer were corrected in all versions, again to focus on prosodic factors. The use of references was also intended to reduce the subject variability in scoring and to suggest to the listeners the target speaking style as a more concrete definition of “natural”. Since it was too tedious for subjects to rate several versions of an utterance at once, two experiments were run. There was some overlap in the stimuli to test whether the scores would be similar across experiments. Four subjects participated in both experiments, but the experiments were separated in time by a few months. Subjects were asked to listen to the reference utterances first and assume those utterance were given scores of 1 (least natural) and 10 (most natural). Then they could listen to the test utterances and score each for naturalness on a scale of 1 to 10. The four versions of each test utterance were arranged in random order to account for learning bias, however the order of presentation of the eight sentences preserved the flow of the discourse in the original news story so that the target prosodic style would indeed seem appropriate or “natural”. Subjects were allowed to play the test and reference utterances as many times as they wanted. Listening was performed via loudspeakers in an isolated room. 2.2 Prosodic Control Variables The prosodic parameters that we allowed to vary in our experiments include symbolic labels (phrase breaks, pitch accents and tones) and acoustic parameters (phone duration, pitch range and F0 contour). For the stimuli where “natural” symbolic labels were used, these were based on a hand-labeled prosodic transcription of the target utterances based on the ToBI labeling system [5]. Phrase breaks included location of minor and major phrases (ToBI breaks levels 3 and 4, respectively). For the cases where breaks and accents are used, but no tones, the synthesizer default tone assignment is implemented. For the case where the ToBI tones