TRANSLINGUAL VISUAL SPEECH SYNTHESIS Tanveer A. Faruquie, Chalapathy Neti + , Nitendra Rajput, L. Venkata Subramaniam, Ashish Verma IBM India Research Lab, New Delhi 110016, India + IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA ABSTRACT Audio-driven facial animation is an interesting and evolv- ing technique for human-computer interaction. Based on an incoming audio stream, a face image is animated with full lip synchronization. This requires a speech recogni- tion system in the language in which audio is provided to get the time alignment for the phonetic sequence of the au- dio signal. However, building a speech recognition system is data intensive and is a very tedious and time consuming task. We present a novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English. The method presented here can also be used for text to audio-visual speech synthesis. 1. INTRODUCTION To realize a natural and friendly interface is very important in human-computer interaction. Speech recognition and com- puter lip-reading have been developed as a means of in- putting information for interaction with the computer. It is also important to provide a natural and friendly means for the computer to render the information. Interpersonal com- munications, human-computer interaction, telework, teled- ucation, multimedia telephones, animation and various other multimedia approaches in communication offer the motiva- tion to design realistic facial animators. Such systems repre- sent a means for simplifying, enhancing and in many cases completely changing the current paradigms of interpersonal and human-computer communication. A talking face, with lip movements in synchronization with the spoken words and sentences, greatly enhances com- munication. Many methods have been presented to animate the face in sync with the audio [1][2][3]. These methods rely on a viseme based alignment being generated from the incoming audio, where visemes are different, distinguish- able lip shapes. For this a speech recognition system is used to generate the phonetic alignment from the incom- ing audio. Phonetic alignment refers to the time duration and the transition times between phonemes in an audio se- quence [4]. A phoneme to viseme mapping then generates the visemic alignment from the phonetic alignment. Techniques exist for synthesizing speech given text as input to the system. These text to speech synthesizers work by producing a phonetic alignment of the text to be pro- nounced and then by generating the smooth transitions in between adjacent phones to get the desired sentence [6]. Us- ing a phoneme-vimseme mapping and text-to-speech syn- thesis, a text-to-video synthesizer can be built. In the audio driven animation case, the phonetic alignment is generated from the audio with the help of the true word string rep- resenting the spoken sentence. Thus facial animation can be driven by text or audio, depending on the needs of the application. Audio driven facial animation requires training of a speech recognition system which is used for generating alignments from the input speech. Once the phonetic alignment is gen- erated, the mapping and the animation hardly have any lan- guage dependency in them. Translingual visual speech syn- thesis can be achieved if the first step of alignment gener- ation can be made speech independent. In this paper, we present a method for translingual visual speech synthesis, that is, given a speech recognition system for one language, we describe a method of synthesizing video with speech of any other language. In Section 2 we describe a general visual speech synthe- sis module. In Section 3 we describe the main idea of this paper, the translingual visual speech synthesis system. We describe in detail the method used to adapt the speech recog- nition system of one language to generate phonetic align- ments in a new language. We present the specific case of adding Hindi words to an English speech recognition sys- tem. In Section 4 we present the modifications required to build a translingual visual speech synthesis system in block diagram form. Finally the conclusions are presented in Sec- tion 5. 2. VISUAL SPEECH SYNTHESIS The visual speech synthesis module is shown in Figure 1. From an incoming audio stream timing information and phoneme transitions are extracted. This constitutes the phonetic align- ment. The phonemes are mapped to the corresponding visemes. This in turn gives the viseme transitions and timings called