User Interactive MPEG-4 Compatible Facial Animation System M. Escher, T. Goto, S. Kshirsagar, C. Zanardi, N. Magnenat Thalmann MIRALab/CUI, University of Geneva 24 rue du General Dufour, CH-1211 Geneva 4, Switzerland Tel: +41 22 705 7769 Fax: +41 22 705 7780 [escher,goto,sumedha,thalmann,zanardi]@cui.unige.ch ABSTRACT This paper describes different processes and their interactions needed to generate a virtual environment inhabited by a clone representing real people and virtual autonomous actors. It requires communication between a cloned face (or avatar) and virtual face. This needs the cloning and mimicking aspects to reconstruct the 3D model and movements of the real face. The autonomous virtual face is able to respond and interact through facial expressions and speech. Several main processing are necessary to reach this goal. The processing of the input data is crucial since it represents the only interaction of the user with the virtual world and autonomous actor. We have implemented the processing of the two basic media used in a dialog, which are speech and facial expressions. We also discuss about the implementation of the emotionally autonomous actor. Finally we give a description of the real-time facial animation system. The whole system is based on the MPEG-4 definition of FAP, visemes and expressions. 1. SYSTEM OVERVIEW Figure 1 sketches the different tasks and interactions to generate a real-time virtual dialog between a synthetic clone and an autonomous actor[1]. The video and the speech of the user drive the clone facial animation, while the autonomous actor uses the information of speech and facial emotions from the user to generate an automatic behavioral response. MPEG-4 Facial Animation Parameters are extracted in real-time from the video input of the face. These FAPs can be either used to animate the cloned face or are processed to compute high-level emotions transmitted to the autonomous actor. Out of the user’s speech, phonemes and text information can be extracted. The phonemes are then blended with the FAPs from the video to enhance the animation of the clone. The text is then sent to the autonomous actor and processed together with emotions to generate a coherent answer to the user. All facial animation parameters are compliant with the MPEG-4 definition[2]. 2. REAL-TIME ACQUSITION OF THE USER DATA 2.1 Facial Parameters Tracking face features is a key issue in facial animation[3,4,5]. But there are many problems to track them in real time. One important problem lies in the variety in the appearances of individuals, such as skin color, eye color, beard, glasses, and so on. The facial features are sometimes not separated by sharp edges, or edges appear unusual places. This makes it difficult to recognize faces and track facial features. In this application, face features and information are set during initialization to solve this problem. Figure 2(a) shows this initialization. Once the user sets the feature positions, information around the features are extracted automatically. This information makes it possible to track any face and the corresponding facial features without any marker. The tracking process is mainly separated into two parts viz. mouth tracking and eye tracking. For the tracking, edge information, and gray level information around the mouth and the eyes are used. Figure 2(b)-(c) shows the result of tracking features.