Perceptual evaluation of text-to-speech implementation of enclitic stress in Greek S.-E. Fotinea, A. Protopapas, D. Dimitriadis and G. Carayannis Institute for Language and Speech Processing Epidavrou & Artemidos 6, 151 25 Maroussi, Greece evita@ilsp.gr Abstract This paper presents a perceptual evaluation of a text to speech (TTS) synthesizer in Greek with respect to acoustic registration of enclitic stress and related naturalness and intelligibility. Based on acoustical measurements and observations of naturally recorded utterances, the corresponding output of a commercially available formant-based speech synthesizer was altered and the results were subjected to perceptual evaluation. Pitch curve, intensity, and duration of the syllable bearing enclitic stress, were acoustically manipulated, while a phonetically identical phrase contrasting only in stress served as control stimulus. Ten listeners judged the perceived naturalness and preference (in pairs) and the stress pattern of each variant of a base phrase. It was found that intensity modification adversely affected perceived naturalness while increasing perceived stress prominence. Duration modification had no appreciable effect. Pitch curve modification tended to produce an improvement in perceived naturalness and preference but the results failed to achieve statistical significance. The results indicated that the current prosodic module of the speech synthesizer reflects a good balance between prominence of stress assignment, intelligibility, and naturalness. Introduction In Greek, only one of the last three syllables of a word may be stressed (Antepenultimate Rule). As a result, a very common and interesting phenomenon is the one of enclisis of stress, in certain cases of a clitic being attached to its preceding word. Special treatment is demanded when intonation words formulated by use of clitics trigger the phenomenon of stress enclisis, which involves two stressed syllables in one intonation word. As Holton, Mackridge & Philippaki-Warburton (1997) note, “The phonological word is a domain in which under certain conditions, more than one stress may appear: a basic and a derived one.” With the attachment of the enclitic grammatical word, the Antepenultimate Rule may be violated. To recover, a second stress appears in the word. For example, /tok'alima/ (the cover) tok'alim'amu/ (my cover). Violation of the Antepenultimate Rule is recovered by the appearance of the second stress, which is corrective and thus stronger (Holton, Mackridge & Philippaki-Warburton, 1997). Stress enclisis occurs in the following situations: 1. When a noun, adjective, adverb, or verb is stressed on the antepenultimate and is followed by a weak personal pronoun belonging to the same phrase, then a secondary stress must be placed on the last syllable of the first word. For instance, /x'aris'emuto/ (give it to me). 2. When a verb in the imperative is stressed on the penultimate and is followed by two weak pronouns, a secondary stress must be placed on the pronoun nearer to the verb. For instance, /r'ikset'uto/ (throw it to him). 3. If a gerund stressed on the antepenultimate is followed by one or two weak pronouns, a secondary stress will be placed on the last vowel of the gerund. For instance, /r'ixnond'asmu/ (throwing to me). Earlier work on the pitch curve evolution (Fotinea, Vlahakis & Carayannis, 1997; Fotinea, 1999) has revealed that double stress intonation words at a position that is not sentence final and in an affirmative, neutral stress way of expression, can be piece-wise linearly approximated by an F0 pattern called Double-Stress Introductory (Db-I) pattern, which displays two points of pitch rise. The pitch curve remains almost flat until the first stressed syllable, then a pitch rise is observed, and in the next (unstressed) syllable a slight declination follows that allows for a new (more evident) increase of pitch at the next stressed syllable lasting to the end of the intonation word. This is noticed also in other researchers’ results (Arvaniti, 1992). In this work we investigated the interdependency of all three prosodic parameters, namely intensity, duration and pitch curve evolution, for the special case of stress enclisis, in the context of automatic speech synthesis. A formant Text-To-Speech synthesizer in Greek that has been developed at ILSP and is commercially available (“Ekfonitis”) was used for the evaluation of the implementation of stress enclisis. In this system, intensity, duration, and pitch can be independently controlled. A prosodic module is also available that assigns stress to appropriately selected syllables according to rules by manipulating all three parameters in unison. The objective of this study was to evaluate the existing rules and to further explore the parametric space for the acoustic realization of enclitic stress. To this end, a set of short utterances were created, specifically contrasting stress among lexical items with and without attached clitics. As detailed below, listeners judged the stress pattern intelligibility and the naturalness of the synthesized speech, as a function of combinations of the three prosodic parameters. An important objective for this study was to employ a number of different assessment measures in order to separate (subjective) perceived preference from accurate perception of an intended stress pattern. Method Acoustical measurements on natural recordings In order to study the acoustic marking of enclitic stress assignment, a set of sentences were constructed that contrasted regular stress, enclitic stress, and no stress, in phonetically balanced phrases. The basic phrase was /k'anod'as tu to to pr'a ma/ (“doing the thing to him”), perfectly contrasting only in stress with /k'anodas t'uto to pr'a ma/ (“doing this thing”). Additional variations were recorded and studied (/k'anodas to pr'a / – “doing the thing”; /k'anod'as tu t'uto to pr'a ma/ – “doing this thing to