THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Series A, OF THE ROMANIAN ACADEMY Volume 11, Number 1/2010, pp. 92–99 RECENT ADVANCES IN ROMANIAN LANGUAGE TEXT-TO-SPEECH SYNTHESIS Dragoş BURILEANU *, ** , Cristian NEGRESCU * , Mihai SURMEI * * “Politehnica” University of Bucharest, Faculty of Electronics, Telecommunications and Information Technology ** Romanian Academy Center for Artificial Intelligence, Bucharest Corresponding author: Dragos BURILEANU, E-mail: bdragos@mESsnet.pub.ro Spoken language interfaces are currently playing an increasing role in the human-machine interaction, becoming a necessity for most of the new practical applications and services demanded by our modern and “mobile” world. This is mainly a consequence of the fact that communication networks may offer simple and inexpensive access to a large amount of diversified information and services, concurring to the development of many economic and social domains from the Information Society. In particular, machine’s voice output is already largely required to expand the potential of the commercial applications, adding flexibility, speed and naturalness to existent interfaces. This paper describes our recent work in developing a high-quality text-to-speech synthesis system in Romanian language and presents a telecommunication platform based on a client-server architecture and standardized signaling protocols for accessing text information through communication channels. Key words: Text-to-speech synthesis; Non-uniform unit selection; Harmonic plus Noise Model; Partial processing; Client-server architecture; RSS reader; Open protocols; MRCP; SIP/RTP. 1. INTRODUCTION Two important characteristics of our Information Society can be easily noticed nowadays: mobility, expressed by the widespread use of portable electronic devices, and the steady users’ demand for a more simple, natural and effective interaction with their portable systems, to get easier access to information, any time, and from everywhere. On the other hand, speech technology can offer the simplest and most natural interface to a computing environment, allowing for hands-free and eyes-free operation and for a wider access to information and services. As a result, and due to the significant advances in spoken language processing brought in the last two decades, lots of speech-enabled applications together with new families of intelligent and interactive services based on voice input/output became available [5, 8]. Particularly, machine’s voice output is largely required today to access a variety of services in network- based applications. Speech synthesis technology can be a viable option for fast, easy and efficient access to text messages using communication networks. Certainly, a service that will facilitate the access to a web feed or to the messages stored on the e-mail server (for example) using the usual phone line and a portable device brings a supplementary value by its new mobility dimension. However, reading aloud written messages such as news feed or e-mail/SMS encounters two major problems. First, in this kind of applications one cannot predict the message to be spoken, so the system must generate the speech from arbitrary texts (database records, e-mail/SMS messages, etc.). This task can be accomplished only by a complete text-to-speech (TTS) synthesis system, which must provide at least a very good intelligibility for the resulting speech to be helpful and accepted by the user [4]. Then, one must choose between two possible approaches: settling the whole TTS application on the mobile device, or using standard communication channels and a client-server architecture [1, 10]. Due to computing and memory resource constraints and cost and power consumption limitations of the complete in-device solution, the second approach is more often preferred in present. Besides the difficulty of developing highly natural synthesis systems, industry speech application developers are making notable efforts to design and propose good quality TTS solutions; several companies already use TTS synthesis in providing diverse information to users over the telephone line.