www.ccsenet.org/mas Modern Applied Science Vol. 4, No. 10; October 2010 Published by Canadian Center of Science and Education 97 Concatenative Synthesis of Persian Language Based on Word, Diphone and Triphone Databases Reza Javidan (Corresponding author) Computer Engineering Department Islamic Azad University – Beyza Branch, Fars, Iran Tel: 98-917-315-9656 E-mail: reza.javidan@gmail.com Iman Rasekh Computer Engineering Department Islamic Azad University – Arak Branch, Arak, Iran Tel: 98-917-3072-2481 E-mail: iman.rasekh@gmail.com Abstract In this paper a new Persian text-to-speech system based on concatenative speech synthesis approach is proposed. Nowadays, concatenative method is used in most modern TTS systems to produce artificial speech. In concatenative method, selecting an appropriate unit for creating a database is a challenging task. In the proposed approach, such database is created with different sizes of speech units and is used to produce speak utterances which include words, diphones and triphones. For the synthesis process, Diphones and triphones which are smaller speech units are used to achieve unlimited vocabulary of speech, while the word units are used for synthesizing which make limited set of sentences. Moreover, a dictionary of 600 common Persian words is built. The simulation results on prototype data show the effectiveness of the proposed method. Keywords: Persian text-to-speech synthesis, Artificial neural networks, Concatenative synthesis, Word, Diphone, Triphone 1. Introduction A Text-To-Speech synthesizer (TTS) is a computer-based program in which the text is processed by a computer and the computer reads the text aloud. For most applications, there is a demand on the technology to deliver good and acceptable quality of speech. High quality speech synthesis in electronic text format has been a focus of research activities in past two decades, and it has led to an increasing horizon of applications (T. Dutoit. 1997). To mention a few, commercial telephone response systems, natural language computer interfaces, reading machines for blind people and other aids for the handicapped, language learning systems, multimedia applications, talking books and toys are among the many examples. The speech synthesizer consists of two main components, namely: Text processing component and Digital Signal Processing (DSP) module. The text processing component has two major steps: Step1: Converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words; this process is often called text normalization. Step2: It converts the text into some other representation and output it to the DSP module or synthesizer, which transforms the symbolic information it receives into speech (Kain Alexander, and P.H. Santen Jan Van. 2003). The primary technologies for generating synthetic speech waveforms are including formant synthesis and concatenative synthesis (R. J. Deller, et al. 2000). Each of these technologies has its own strengths and weaknesses and the intended users of a synthesis system will typically determine which approach will be used. Formant synthesizers, which are usually controlled by rules, have the advantage of having small footprints at the expense of the quality and naturalness of the synthesized speech (Z. Namnabat, and M. M. Homayunpoor. 2004.). Speech synthesizer that is built as a result of this article depends on the concatenative synthesis approach. In concatenative synthesis the waveforms are created by concatenating parts of natural speech recorded by humans. The easiest way to produce intelligible and natural synthetic speech is to concatenate prerecorded utterances. This method is limited to one speaker and one voice and the recorded utterances require a larger storing capacity compared to the other methods of speech synthesis (O. Karaali, G. Corrigan and I. Gerson. 1996).