Text2Video: Text-Driven Facial Animation using MPEG-4 J. Rurainsky ∗ and P. Eisert † Fraunhofer Institute for Telecommunications - Heinrich-Hertz Institute Image Processing Department D-10587 Berlin, Germany ABSTRACT We present a complete system for the automatic creation of talking head video sequences from text messages. Our system converts the text into MPEG-4 Facial Animation Parameters and synthetic voice. A user selected 3D character will perform lip movements synchronized to the speech data. The 3D models created from a single image vary from realistic people to cartoon characters. A voice selection for diﬀerent languages and gender as well as a pitch shift component enables a personalization of the animation. The animation can be shown on diﬀerent displays and devices ranging from 3GPP players on mobile phones to real-time 3D render engines. Therefore, our system can be used in mobile communication for the conversion of regular SMS messages to MMS animations. Keywords: MPEG-4, Facial Animation, Text-Driven Animation, SMS, MMS 1. INTRODUCTION Inter human and human-machine communication are two of the major deﬁances of this century. Video com- munication between people becomes more and more desirable with the fast growing available connectivity. The latest video compression techniques like H.264/AVC 1 or Windows Media 10 are able to highly reduce bit-rate for video data and enable communication over a wide variety of diﬀerent channels. For even lower bandwidth, MPEG-4 standardized a communication system with 3D character models that are animated according to a set of Facial Animation Parameters (FAP). 2 These parameters describe motion and facial mimic of a person and can be eﬃciently encoded. Model-based video codecs based on MPEG-4 FAPs enable video communication at a few kbps. 3, 4 However, talking head videos can also be created without sending any information from a camera. Since there is a high correlation between the text, speech, and lip movements of the person, an artiﬁcial video can be synthesized purely from the text. We have developed a scheme, which allowsthe user to communicate by means of video messages created from the transmitted text. A Text-To-Speech engine (TTS) converts the message into a speech signal for diﬀerent languages and markup information, like phonemes, phoneme durations, as well as stress levels for each phoneme. These side information are used to estimate MPEG-4 Facial Animation Parameters that are applied onto the 3D head model. Rendering of the head model leads to a realistic facial animation synchronized with the speech signal. 5–7 A similar system extracts the MPEG-4 FAPs at the receiver. 8 Realistic voices for TTS engines require a large set of speech samples, which have to be stored locally. The usage of a TTS engine for devices like PDAs and cellular phones requires either more memory than regular provided or the acceptance of less quality of the synthetic voice. Human-Machine Interfaces with a TTS engine for face animation are developed as well. 9, 10 In the following sections, we describe how our system requests input data as well as selections in order to individualize the video clips for diﬀerent applications. Further, we describe the facial animation parameter creation and the rendering on diﬀerent devices. We included a section for the transport within LAN and WLAN connections and explain diﬀerent display interfaces. ∗ rurainsky@hhi.fraunhofer.de † eisert@hhi.fraunhofer.de