Comparing a Data-Driven and a Rule-Based Approach to Predicting Prosodic Features of German Hansjörg Mixdorff 1,2 Oliver Jokisch 2 1 Faculty of Computer Sciences 2 Laboratory of Acoustics and Speech Communication Berlin University of Applied Sciences Dresden University of Technology Mixdorff@tfh-berlin.de Oliver.Jokisch@ias.et.tu-dresden.de Abstract The perceived quality of synthetic speech strongly depends on its prosodic naturalness. Departing from works by Mixdorff on a linguistically motivated model of German intonation based on the Fujisaki model, the current paper presents a perceptual comparison between a rule-based sequential and a data-driven integrated model of prosody. The experiment comprised resynthesis and diphone synthesis stimuli produced with prosodic features predicted by the two models. In addition, a number of reference stimuli from an earlier experiment were included which were produced by means of controlled prosodic degradation. Isolated sentences from two different corpora served as speech material. Results show that the integrated model outperforms the rule-based model in terms of the accuracy of phone durations predicted. In terms of the F0 contour generation, however, the integrated model is not rated better. Furthermore, these findings are only significant in the case of resynthesis stimuli whereas diphone synthesis stimuli are judged generally as being of poor quality, irrespective of the prosodic model. This fact clearly speaks against testing diphone synthesis samples against resynthesis in the same evaluation, as the segmental quality overrules the prosodic quality. 1. Introduction It is widely acknowledged that the intelligibility and perceived naturalness of synthetic speech strongly depends on the prosodic quality. Recent systems concatenating larger chunks of speech from a data base achieve a considerably high quality (see, for instance, [1]), as they preserve the natural prosodic structure at least throughout the chunks chosen and aim to minimize the distortion incurred at the edges. These systems, however, are often domain-specific, and the question of optimal unit-selection still calls for the development of improved prosodic models. Earlier work by Mixdorff focussed on a model of German intonation which uses the quantitative Fujisaki formulation of the production process of F0 [2] for parameterizing F0 contours. The contour is described as a sequence of linguistically motivated tone switches, major rises and falls, which are modeled by onsets and offsets of accent commands connected to accented syllables or boundary tones. Prosodic phrases correspond to the portion of the F0 contour between consecutive phrase commands [3]. The model was integrated into the TU Dresden TTS system DRESS, and proved to produce a high naturalness compared with other approaches [4]. Perception experiments, however, indicated shortcomings in the duration component of the synthesis system and raised the question how intonation and duration model should interact in order to achieve the highest prosodic naturalness possible. Most conventional TTS systems for German like DRESS calculate prosodic parameters sequentially, generating syllable durations first and then aligning the F0 contour appropriately.