Towards a High-Quality and Well-Controlled Finnish Audio-Visual Speech Synthesizer Mikko Sams, Janne Kulju, Riikka Möttönen, Vili Jussila, Jean-Luc Olivés, Yongjun Zhang, Kimmo Kaski Helsinki University of Technology, Laboratory of Computational Engineering, P.O. Box 9400, FIN-02015 HUT, Finland Päivi Majaranta, Kari-Jouko Räihä University of Tampere, Department of Computer Science, P.O. Box 607, FIN-33101 Tampere, Finland ABSTRACT We have constructed an audio-visual text-to-speech synthesizer for Finnish by combining a facial model with an acoustic speech synthesizer. The quality of the visual speech synthesizer has been evaluated twice. In addition, we have started to refine the facial model taking a more physiologically and anatomically based approach. The synthesizer will be used to produce stimuli for studies of neurocognitive mechanisms of audio-visual speech perception. This sets requirements for flexibility and full controllability for the synthesis. We are also developing applications for the synthesizer. Keywords: audio-visual speech, facial animation, multimodality, physically-based model, speech synthesis. 1. INTRODUCTION The perception of speech is normally audiovisual. By using both auditory and visual modalities we can understand the message better than by relying on audition only. The visual component improves the intelligibility of speech especially when speech is exposed to noise [6], bandwidth limitation [5], hearing limitations or other disturbances. The two modalities convey complementary information; while some utterances (for example /ba/ and /da/) can be difficult to distinguish based on auditory information only, they are visually clearly distinguishable. On the other hand, /pa/ and /ma/ are visually very similar but they are easy to discriminate on the basis of auditory signal. Visual speech perception has its natural limits. We can’t perceive the whole vocal tract visually but have to rely on information primarily from lips, tongue and teeth. Visual information is also crucial in determining the identity and emotional state of the talker, the reaction of listener and in conducting a fluent dialogue between two or more people. We have constructed our first version of a Finnish text-to-audiovisual-speech synthesizer [1], which can produce real-time speech from unlimited written text. The visual part is a descendant of Parke’s facial model [3], and it is synchronized with an acoustic text-to-speech synthesizer. The intelligibility of our synthesizer has been evaluated twice [2,8]. The model without a tongue was used in the first evaluation. The second one was performed when the tongue model was added to the visual speech synthesizer and some phoneme articulations were improved on the basis of the first evaluation. Our objective is to further improve the quality of synthesis and to use the synthesizer as a stimulus generator for speech perception experiments. We are also developing applications for the synthesizer. It will be used, e.g., in teaching lip-reading. To achieve these goals an appropriate user interface for controlling the synthesizer is required. 2. THE CURRENT FACIAL MODEL Our facial model is presented in Fig. 1. The geometry of the model is defined with slightly less than 1000 vertices that are used to construct about 1500 polygons. These figures do not contain the tongue vertices and polygons, because the amount of them can not be unambiguously stated. The facial geometry is controlled with 49 parameters 12 of which are used for visual speech. The parameters used in speech production are based on used coordinate system rather than physiological properties of the face.