Volume 49, Number 3, 2008 ACTA TECHNICA NAPOCENSIS Electronics and Telecommunications ________________________________________________________________________________ Manuscript received September 1, 2008; revised October 5, 2008 45 EXPLORING FINE PHONETIC DETAIL FOR ROMANIAN TEXT TO SPEECH SYNTHESIS Mircea GIURGIU Technical University of Cluj-Napoca, Telecommunications Department, 26 Baritiu Str., 400027 Cluj-Napoca, Romania E-mail: Mircea.Giurgiu@com.utcluj.ro Abstract: This paper presents several experimental results on the study of Fine Phonetic Detail (FPD) in order to create the basis of a computational model for prosody description in a system for Text To Speech (TTS) synthesis. Different manifestation levels of FPDs together with their importance for TTS are presented. The experiments reported here, mainly take into consideration the linguistic aspect of prosody in a qualitative and quantitative manner: the influence of the accent at word and sentence level, the intonation, the rhythm and the speech rate. Key words: speech processing, fine phonetic details, prosody control, text to speech synthesis I. INTRODUCTION One of the most important problems in concatenative speech synthesis is the selection of the acoustic units. Some of existing telecommunications applications are using phrases or words, but this is applicable only if the system requires a high synthesis quality and the set of synthesised messages is very small. For an unlimited Text To Speech (TTS) of a language it is practically impossible to store all the words, that’s why the actual TTS systems are using smaller acoustic units such as: phonemes, diphones, triphones or subphonemic segments. The phoneme appears to be an attractive linguistic unit for speech synthesis because of the limited number of phonemes in any language. Still, one major reason for not being practically used is that the boundaries between the phonemes usually corresponds to areas that are acoustically volatile [1]. In the case of diphone concatenation, the acoustic segment captures all the transitional information that is usually present between the phonemes [4]. The goal of the reported experiments is to find out what factors have to be considered in the Fine Phonetic Details (FPD) analysis in Romanian diphones as well as to question how a prosodic model would be able to generate the acoustic parameters for a TTS. II. FINE PHONETIC DETAILS OF SPEECH The classic definition of prosody refers to the speech features whose domain is not a single phonetic segment, but larger units of more than one segment, possibly whole sentence. Consequently, prosodic phenomena are often called supra-segmental speech features. They appear to be used to structure the speech flow and are perceived as stress or accentuation, or as other modifications of intonation, rhythm and loudness. There are four principal manifestation levels of prosodic phenomena: a) linguistic level, b) articulatory level, c) acoustic level, d) perceptual level. a) The linguistic intention level: the speaker can be assumed to employ prosodic coding with a certain intention. This intention can influence both linguistic and paralinguistic expression. By linguistic expression is meant any oral expression using language signs. Paralinguistic phenomena include non-verbal vocalisations that make an utterance to sound angry, urgent or ironic. Examples of linguistic distinctions that tend to be communicated by prosodic means are the question-statement distinction or the semantic emphasis of an element. Systematic knowledge of how these phenomena are used in human speech can be expected to play a significant role in improving the naturalness of the synthetic speech. From linguistic point of view, prosody is generally thought of as relating different linguistic elements to each other, above all accentuating certain elements of a text, by marking boundaries and by defining transition between words or phrases. b) The articulatory manifestation level: prosodic phenomena are physically manifested at a series of modifications of articulatory movement. Such phenomena do not result in separate, identifiable articulations. For example, the stressed syllable /ve/ in “vesel” (happy, jubilant) does not involve an articulatory movement distinctive of a more neutral, destressed articulation of the same syllable in “vesel” (serving dishes). Pertinent physical observations of prosodic manifestations thus typically include variations in the amplitude or air pressure. c) The acoustic realisation level: it may be observed and quantified using acoustic signal analysis. The main acoustic parameters bearing on prosody are: fundamental frequency, intensity and duration. d) The perceptual level: it refers to the perceptual reactions to prosodic phenomena and it may be quantified by: pauses, length, pitch/melody and loudness. In the following experiments priority is given to relating linguistic distinctions to acoustic aspects of prosody.