A Phase Vocoder Model of the Glottis for Expressive Voice Synthesis Eduardo Reck Miranda SONY Computer Science Laboratory Paris, 6 rue Amyot, 75005 Paris, France Tel: +33 1 44 08 05 11, Fax: +33 1 45 87 87 50, E-mail: miranda@csl.sony.fr [PATENT FILED IN EUROPE ON JUNE 2000 – 00 401 560.8] Abstract In this paper we explain how we are improving the source component of a source-filter vocal synthesis system. Our strategy for this improvement involves the replacement of the pulse generator by a phase vocoder module whose coefficients are derived from the analysis of speech signals. Firstly, we introduce the context of our research and then indicate the problem; finally, we present our solution followed by our conclusions and suggestions for further work. 1 The context of this research In order to improve the linguistic abilities of computer and robotic systems, researchers at Sony CSL in Paris are studying the fundamental mechanisms of language evolution. One of these fundamental mechanisms involves the emergence of phonetic and prosodic repertoires. The study of these mechanisms requires a voice synthesiser that is able to: i) support evolutionary research paradigms, such as self-organisation and modularity, ii) support a unified form of knowledge representation for both vocal production and perception, and iii) speak and sing expressively (including emotion and paralinguistic features). 2 Introduction to the problem There are two main fundamental approaches to voice synthesis: the sampling approach and the source-filter approach [1]. In short, the former works by building an indexed data-base of digitally recorded spoken segments. A playback engine then assembles the words by combining these segments sequentially. The latter approach synthesises sounds from scratch, by mimicking the functioning of the human vocal tract. The sampling approach to voice synthesis is normally preferred for manufacturing text-to-speech systems. However, this approach does not suit any of the three basic needs of our research. Conversely, the source-filter approach is compatible with requirements i) and ii) above, but the systems that have been proposed so far need to be improved in order to best fulfil requirement iii). In this paper we indicate how we are improving this. In the following sections we present the generic architecture of a source-filter synthesiser and focus on the component that we are improving. Then, we present our strategy for this improvement. Finally, we present our conclusions and delineate further work. 3 Technical background The source-filter model is based upon the insight that the production of vocal sounds can be simulated by generating a raw source signal which is subsequently moulded by a complex filter arrangement [2]. In humans, the raw sound source corresponds to the outcome from the vibrations created by the glottis and the complex filter corresponds to the vocal tract “tube”. From the various ways of implementing these filters, we have opted for the waveguide ladder technique [3] due to its ability to incorporate non-linear vocal tract losses in the model (e.g. the viscosity and elasticity of the tract walls). In general terms, the vocal tract is considered as a tube (with a side-branch for the nose) sub-divided into a number of cross-sections whose individual resonances are simulated by the filters. In order to facilitate the specification of the parameters for these filters, the system is normally furnished with an interface that converts articulatory information (e.g. the positions of the tongue, jaw and lips) onto filter parameters; hence the reason the source-filter model is sometimes referred to as the articulatory model [4]. Utterances are then produced by telling the program how to move from one set of articulatory positions to the next, similar to a key-frame visual animation. This articulatory simulation works satisfactorily for the filter part of the synthesiser. However, the importance of the source signal has been largely overlooked. Substantial improvements in the quality and flexibility of source-filter synthesis can be made by addressing the importance of the glottis more carefully. In the next section we explain how the source is normally implemented in currently available systems and we propose an alternative to improve this practice.