INTERFACE: a Matlab © Tools for Building Animated MPEG4 Talking Heads from Motion-Captured Data Graziano Tisato, Piero Cosi, Carlo Drioli, Fabio Tesser Istituto di Scienze e Tecnologie della Cognizione Sezione di Padova “Fonetica e Dialettologia” Via G. Anghinoni, 10 – 35121 Padova (ITALY) +39 049 8274413 [tisato, cosi, drioli, tesser]@pd.istc.cnr.it ABSTRACT INTERFACE is an integrated software, designed and implemented in Matlab©, for building emotive/expressive talking heads from motion-captured data. INTERFACE simplifies and automates many of the operation needed for that purpose. A set of processing tools, focusing mainly on dynamic articulatory data physically extracted by an automatic optotracking 3D movement analyzer, was implemented in order to build up the animation engine, that is based on the Cohen-Massaro coarticulation model, and also to create the correct WAV and FAP files needed for the animation. LUCIA, our animated MPEG-4 talking face, in fact, can copy a real human by reproducing the movements of some markers positioned on his face and recorded by an optoelectonic device, or can be directly driven by an emotional XML tagged input text thus realizing a true audio visual emotive/expressive synthesis. LUCIA’s voice is based on an Italian version of FESTIVAL-MBROLA packages, modified for expressive synthesis by means of an appropriate APML/VSML tagged language. Categories and Subject Descriptors I.6.7 [SIMULATION AND MODELING]: Simulation Support Systems General Terms Algorithms, Design, Human Factors, Standardization Keywords Talking Head, Facial Animation, Motion Capture, MPEG4. 1. INTRODUCTION The transmission of emotions in speech communication is a topic that has recently received considerable attention. Automatic speech recognition (ASR) and multimodal or audio-visual (AV) speech synthesis are examples of fields in which the processing of emotions can have a great impact and can improve the effectiveness and naturalness of man-machine interaction. In our TTS (text-to-speech) framework, AV speech synthesis, that is the automatic generation of voice and facial animation from arbitrary text, is based on parametric descriptions of both the acoustic and visual speech modalities. The visual speech synthesis uses 3D polygon models, that are parametrically articulated and deformed, while the acoustic speech synthesis uses an Italian version of the FESTIVAL diphone TTS synthesizer [1] now modified with emotive/expressive capabilities. 2. INTERFACE INTERFACE, whose block diagram is given in Figure 1, is an integrated software designed and implemented in Matlab© in order to simplify and automates many of the operation needed for building-up a talking head from motion-captured data. INTERFACE was mainly focused on articulatory data collected by ELITE, a fully automatic movement analyzer for 3D kinematics data acquisition [2], but could be easily adapted to other motion-captured data. ELITE provides for 3D coordinate reconstruction, starting from 2D perspective projections, by means of a stereophotogrammetric procedure which allows a free positioning of the TV cameras. The 3D data coordinates are then used to create our lips articulatory model and to drive directly, copying human facial movement, our talking face. INTERFACE was created mainly to develop LUCIA [3] our graphic MPEG-4 [4] compatible facial animation engine (FAE). In MPEG-4 FDPs (Facial Deﬁnition Parameters) deﬁne the shape of the model while FAPs (Facial Animation Parameters), deﬁne the facial actions [5]. In our case, the model uses a pseudo- muscular approach, in which muscle contractions are obtained through the deformation of the polygonal mesh around feature points that correspond to skin muscle attachments. A particular facial action sequence is generated by deforming the face model, in its neutral state, according to the specified FAP values, indicating the magnitude of the corresponding action, for the corresponding time instant. For a complete description of all the features and characteristics of INTERFACE, a full detailed PDF manual is being prepared and it is available at the official LUCIA web site: http://www.pd.istc.cnr.it/LUCIA/Docs/InterFace-AISV2004.pdf INTERFACE, handles four types of input data from which the corresponding MPEG-4 compliant FAP-stream could be created: (A) Articulatory data, represented by the markers trajectories captured by ELITE; these data are processed by 4 programs: • “Track”, which defines the pattern utilized for acquisition and implements a new 3D trajectories reconstruction procedure;