A Rule-Based Approach to Build a Text-to-Speech System for Romanian Ovidiu Buza, Gavril Toderean, Jozsef Domokos Faculty of Electronics and Telecommunications Technical University of Cluj-Napoca Cluj-Napoca, Romania Ovidiu.Buza@com.utcluj.ro, toderean@pro3soft.ro, jdomokos@com.utcluj.ro Abstract—We present in this article our approach for building a text-to-speech system for Romanian. Main stages of this work were: voice signal analysis, region segmentation, construction of acoustic database, text analysis, unit and prosody detection, unit matching, concatenation and speech synthesis. In our approach we consider word syllables as basic units and stress indicating intrasegmental prosody. A special characteristic of current approach is rule-based processing of both speech signal analyse and text analyse stages. Keywords- text-to-speech; rule-based approach; syllable detection I. INTRODUCTION This article presents our experience in building a voice synthesis system complying with quality parameters of human speech. Our researches led into projecting a voice synthesis method specifically adapted to Romanian language, and also into a working approach for constructing an automated speech synthesis system. Using syllables as basic units, the projected method is integrated into high quality methods category, based on concatenation. The novelty of this work consists in the fact that our approach is based on rules that apply in the most important stages of speech synthesis system realisation (Fig. 1): a) signal processing stage, by extracting acoustic units from speech using association rules, b) text processing stage, by extracting linguistic units using phonetic and lexical rules. . Figure 1. Stages of system realisation where specific rules do apply In signal processing stage, we have extracted main parameters of pre-recorded speech, parameters that will be used in speech segmentation. We have designed an algorithm for signal decomposition in classes of regions, that are associated with phonetic categories of Romanian language. Then we have used a semi-automated algorithm for separating syllables from speech signal and storing them into vocal database. In text processing stage, special phonetic rules have been developed for text processing, linguistic units (syllables) detection, and prosody data (like stress) retrieval. Next, unit matching was done by selecting acoustic units from vocal database according to the linguistic units detected from the input text. And finally, acoustic units are concatenated to form the output speech signal, that is synthesized by the mean of a digital audio card. II. SIGNAL PROCESSING Voice signal processing starts with the detection of signal parameters from recorded speech. This process can be done in time or frequency domain. Time domain processing, that we have used, leads to the detection of signal parameters directly from waveform samples. We have extracted following parameters: maximum and median amplitude, signal energy, number of zero-crosses and fundamental frequency. Signal amplitude [7] gives information about presence or absence of speech, about voiced and unvoiced features of the signal on analyzed segment. In the case of a voiced segment of speech, as a vowel utterance, the amplitude is higher, beside the case of an unvoiced speech segment, where amplitude is lower. Signal energy [8] is used for getting the characteristics of transported power of speech signal. Voiced segments (like vowels) have a higher mean energy, while the unvoiced segments (like fricative consonants) have a lower mean energy. For the majority speech segments, energy is concentrated in 300-3000 Hz band. Number of zero-crosses is used for determining frequency characteristics and voiced/unvoiced features of speech segments [7]. Inside voiced segments number of zero-crosses is lower, while inside unvoiced segments this parameter has much higher values. Signal Acoustic Units Association Rules Text Linguistic Units Lexical Rules Prosodical Data Phonetic Rules Signal Regions 978-1-4244-6363-3/10/$26.00 c 2010 IEEE 83