Workshop on friendly exchanging through the net March 22-24, 2000 HEARING BY EYES THANKS TO THE “LABIOPHONE”: EXCHANGING SPEECH MOVEMENTS Gérard Bailly, Lionel Revéret, Pascal Borel and Pierre Badin Institut de la Communication Parlée (ICP) - UMR CNRS nˇ r5009, INPG & Université Stendhal 46, avenue Félix Viallet, 38031 Grenoble Cedex 1, France TM-il.: ++33 04 76 57 47 11 - Fax: ++33 04 76 57 47 10 e-mail: bailly@icp.inpg.fr - http://www.icp.inpg.fr/ ABSTRACT We present here the “labiophone”, a virtual system for audio-visual speech communication. A clone of the speaker is animated at distance by articulatory move- ments extracted from the speaker’s image and captured thanks to a video-camera centered on the speaker’s face. The clone consists of a mesh driven by a few articulatory parameters and clothed by blended textures. The charac- teristics of the articulatory model and the textures blend- ing are transmitted at the initiation of the dialog. Then only articulatory parameters are transmitted at a very low bit rate through the telecommunication or web network. Preliminary evaluation of such a system is presented be- low. Keywords: speech, facial animation, articulatory modelling, movement estimation, texture mapping. 1. INTRODUCTION Speech communication is multi-modal: if auditory and visual perception provide complementary information about the speaker and its emotional state, they collabo- rate intimately to enhance the intelligibility of the mes- sage, especially in adverse conditions [17, 18, 5]. Co- herence of speech and facial movements help also seg- regating speech streams in a multi-speaker environment (“cocktail-party” effect). Coding standards such as H-261 and H-263 com- press video streams at reasonable rates with a short cod- ing delay. With new mesh- or region-oriented coders [4, 7], inter-personal audio-visual communication can be achieved via the existing telephone network. Similarly, video-conferencing plug-ins offering document-sharing are available for the Web. These plug-ins work either in a “privileged speaker” mode, where only one speaker is visible on the screen, or in an “album” mode, where dif- ferent video frames are placed side by side. If we want to create a unique virtual space gathering all participants, to propose and control realistic view points, new analy- sis/synthesis techniques for implicit or explicit 3D talking heads models should be developed. This paper introduces the “labiophone”, a virtual communication system based on a transmission of speech movements (see ﬁgure 1): movements captured on the video of each speaker control the animation of a vir- tual clone of the speaker (or possibly an anonymous avatar . . . ). We explain below the broad outlines of the project, its technological bolts, the solutions adopted and a ﬁrst evaluation of the system for capturing movements of a 3D face model developed at ICP. 2. MODELLING VISIBLE SPEECH MOVEMENTS The few tentative models of articulatory control for speech built so far have used linear articulatory model based on geometrical ﬁtting [12, 15] or statistical analy- sis [11, 2] of the mid-sagittal vocal tract proﬁle. These models consider thus the articulatory model as a passive system controlled by a set of “independent” or quasi- orthogonal articulatory parameters. This approach contrasts with the biomechanical ap- proach where a generic model of musculo-skeletal system needs to be adapted to the morphology of the speaker. Up to now, only partial biomechanical models have been pro- posed in the literature including orofacial structures [19], tongue [21, 13], jaw and hyoid bone [6] . . . Few tenta- tives coupling these heterogeneous models have been re- ported [16]. Despite its appealing genericity, the biomechanical approach faces major difﬁculties: the number of muscles (typically a few dozen for a complete musculo-skeletal system of the face) largely exceeds - by a factor two or three - the number of articulatory parameters of geomet- ric models. Although a better account of biomechanical characteristics and properties of controlled articulators in speech motor control has been claimed constantly - in 1970, Peter MacNeilage already concluded : “It is obvious from the past few paragraphs that very little is at present known about many aspects of the dynam- ics of speech motor control which could provide clues as to the nature of the mechanism of target speciﬁca- tion and attainment.” [10, p.194] - no speech production model has up to now succeeded in driving a biomechani- cal model from phonetic input. We propose here a linear model of facial movements for speech based on the statistical analysis of the motion of 64 facial points of a subject’s face. This model was developed in the framework of the project “Tête parlante”