In Proceedings of the 4th ISCA Tutorial and Workshop on Speech Synthesis (SSW4), Perthshire, Scotland, September 2001, pp. 167-172 167 The DEMOSTHeNES Speech Composer Gerasimos Xydas and Georgios Kouroupetroglou University of Athens, Department of Informatics and Telecommunications Division of Communication and Signal Processing Panepistimiopolis, Ilisia, GR-15784 Athens, Greece {gxydas, koupe}@di.uoa.gr Abstract In this paper we present the design and development of a modular and scalable speech composer named DEMOSTHeNES. It has been designed for converting plain or formatted text (e.g. HMTL) to a combination of speech and audio signals. DEMOSTHeNES’ architecture constitutes an extension to current Text-to-Speech systems structure that enables an open set of module-defined functions to interact with the under processing text at any stage of the text-to- speech conversion. Details on its implementation are given here. Furthermore, we present some techniques for text handling and prosody generation using DEMOSTHeNES. 1. Introduction A number of modular Text-to-Speech (TtS) systems have been developed during the last years, like CHATR [1], FESTIVAL [2] and EULER [3]. The two major issues for such architectures are how to accommodate the plethora of different linguistic representations and how to make an efficient usage of the information that these representations carry. Both issues have been well dealt by the FESTIVAL system via the introduction of the Heterogeneous Relation Graph (HRG) [4]. DEMOSTHeNES Speech Composer [5] has been carefully designed in order to be a scalable system, flexible to modifications. The core of DEMOSTHeNES is based on the HRG, which was first implemented in FESTIVAL as its basic UTTERANCE structure. However, the architecture of DEMOSTHeNES differs, in order to enable the implementation of the e-TSA Composer presented in [6] and [7], but also to allow a more functional communication between the various modules of the system. An example of the importance of that issue is this: Some TtS applications support the insertion of a small set of tags within a document, in order to change their auditory behavior. For example, the tag <slow> in the text His name is <slow>John</slow>. allows in some applications the user to slow down the speaking rate. However, these tags are very specific and usually the systems that use them have a monolithic architecture and thus they are easy to be interpreted. On the other hand, modular architectures need a more flexible mechanism for embedding such tags in a text. Moreover, these tags cannot be pre-defined, but should be defined according to the available functionality of the system. Our approach does not target to the provision of an open set of such tags to the user, like mark-up languages like the VoiceXML[8] do. According to DEMOSTHeNES specifications, the system has to generate them by the source text. Thus, we call these tags embedded instructions. Another issue that raised the need for an extension to current architectures came from the fact that modern TtS systems ([2] and [3]) manipulate the information not in a raw but in more complex and linked structures (e.g. metrical trees, MLDS etc). Thus, there should be a mechanism for synchronizing the ordinary information that is being analyzed with any embedded instruction. The rest of this paper is organized as follows: in paragraph 2 we present the architecture of DEMOSTHeNES. In paragraph 3 and 4 we present specific implementation of text handling and prosody generation. 2. Architecture DEMOSTHeNES is a modular and open system. Its functionality is defined by customized plug-ins, the modules. Each module can implement an arbitrary number of linguistic, phonological, acoustical etc functions. However, they need a mean for communicate with each other and exchange functionality. This is being done by the kernel and its components that store the shared data that modules exchange. Thus, there are three basic elements (classes) in the architecture of DEMOSTHeNES (prefix V in the following terms stands for Vocal): • VSERVER, which is a communication channel for the rest elements, • VCOM (component), which provides essential structures and services concerning linguistic, phonological, acoustical etc procedures and • VMOD (module), which inherits from VCOM and manipulates the structures of VCOMs, implements extra functionality and defines the behavior of the system, as a linked element (plug-ins). We will present them in the next paragraphs in detail. The basic diagram of this architecture is given in Figure 1. This scheme is very scalable in terms of functionality and performance, as it can be downscaled to meet specific hardware specifications and also allows modules to be inserted and removed, modifying the capabilities of the system. 2.1. Vocal Server (VSERVER) VSERVER actual implements a Directory Service where all the available functionality offered by VCOMs and VMODs is stored. The Directory Service reserves a namespace for each VCOM (and VMOD), where it keeps information (signatures) about the name and the address of the services they offer as