CONSTRUCTION OF THE ACOUSTIC INVENTORY FOR A GREEK TEXT-TO-SPEECH CONCATENATIVE SYNTHESIS SYSTEM Costas Christogiannis, Theodora Varvarigou, Agatha Zappa, Yiannis Vamvakoulas Telecommunications Laboratory Department of Electrical and Computer Engineering National Technical University of Athens 9 Iroon Polytechniou, 15773, Athens, GREECE Chilin Shih Speech Synthesis Research Department Bell Laboratories, Lucent Technologies 700 Mountain Avenue, Murray Hill, NJ, USA, 07974 Amalia Arvaniti Department of Foreign Languages and Literatures University of Cyprus, P.O. Box 20537, Nicosia 1678, CYPRUS. ABSTRACT The development of the Greek Text-To-Speech (TTS) system by NTUA is based on the method of concatenative synthesis and follows the Bell Labs approach to this technique. Concatenative synthesis is one of the simplest methods for speech synthesis and at the same time bypasses most of the problems encountered by articulatory and formant synthesis techniques. The method relies on designing and creating the acoustic inventory of the language by taking real recorded speech, cutting it into segments and concatenating these segments back together during synthesis. The design and implementation of the acoustic database is a key factor for the performance of the synthesizer, since all the possible phone-to-phone transitions must be considered in order to minimize abrupt discontinuities and thus maximize the naturalness of the synthesized utterances. 1. INTRODUCTION A Text-To-Speech (TTS) synthesizer is a computer-based system able to read any text and convert it into speech that resembles as closely as possible a native speaker of the language reading that text. Thus Text-To-Speech can be defined as the automatic production of speech, through a grapheme-to-phoneme transcription of the sentences to utter. In general every TTS synthesizer has two basic structural modules: (i) a Natural Language Processing module (NLP), capable of producing a phonetic transcription of the read text, together with the desired intonation and rhythm (often referred to as prosody); and (ii) a Digital Signal Processing module (DSP), which transforms the symbolic information it receives into speech. Intuitively, the operations involved in the DSP module are the computer analogue of dynamically controlling the articulatory muscles and the vibratory frequency of the vocal folds, so that the output signal matches the input requirements. In order to do it properly, the DSP module should take articulatory constraints into account, since it has been known for a long time that phonetic transitions are crucial for the understanding of speech [1]. This can be achieved in two ways : • Explicitly, in the form of a series of rules which formally describe the influence of phonemes on one another. • Implicitly, by storing examples of phonetic transitions and co-articulations into a speech segment database, and using them just as they are, as ultimate acoustic units. Two main classes of TTS systems have emerged from these alternatives, which quickly turned into synthesis philosophies, given the divergences they present in their means and objectives: synthesis-by-rule and synthesis-by-concatenation. Rule-based synthesizers [2] constitute a cognitive, generative approach of the phonation mechanism and they appear in the form of articulatory and formant synthesizers, which describe speech as the dynamic evolution of parameters [3], mostly related to formant and anti-formant frequencies and bandwidths together with glottal waveforms. Unfortunately, the large number of (coupled) parameters complicates the analysis stage and tends to produce analysis errors. Moreover, formant frequencies and bandwidths are inherently difficult to estimate from speech data. Concatenative synthesizers, in contrast to rule-based ones, possess a very limited knowledge of the data they handle: most of it is embedded in the segments to be chained up. This feature renders concatenative synthesizers simple but efficient, in terms of the quality of the synthetically produced speech. In this paper the design and the preparation of the acoustic database to be incorporated in a concatenative TTS system for Modern Greek is described, along with the sequence of all the tasks that had to be completed before the synthesizer could produce its first utterance.