EVALUATION OF HMM-BASED VISUAL LAUGHTER SYNTHESIS H¨ useyin C ¸ akmak , J´ erˆ ome Urbain , Jo¨ elle Tilmanne, Thierry Dutoit TCTS lab - University of Mons, Belgium ABSTRACT In this paper we apply speaker-dependent training of Hidden Markov Models (HMMs) to audio and visual laughter syn- thesis separately. The two modalities are synthesized with a forced durations approach and are then combined together to render audio-visual laughter on a 3D avatar. This paper fo- cuses on visual synthesis of laughter and its perceptive eval- uation when combined with synthesized audio laughter. Pre- vious work on audio and visual synthesis has been success- fully applied to speech. The extrapolation to audio laughter synthesis has already been done. This paper shows that it is possible to extrapolate to visual laughter synthesis as well. Index Terms— Audio, visual, laughter, synthesis, HMM 1. INTRODUCTION Among features of human interactions, laughter is one of the most signiﬁcant. It is a way to express our emotions and may even be an answer in some interactions. In the last decades, with the development of human-machine interactions and var- ious progress in speech processing, laughter became a signal that machines should be able to detect, analyze and produce. This work focuses on laughter production and more speci- ﬁcally on visual laughter production. Acoustic synthesis of laughter using Hidden Markov Models (HMMs) has already been addressed in a previous work which is state-of-the-art and served as a basis for acoustic synthesis in this work [1]. The goal of audio-visual laughter synthesis is to generate an audio waveform of laughter as well as its corresponding facial animation sequence. This work follows a separated modeling approach. Visual laughter synthesis systems are rare. DiLorenzo et al [2] proposed a parametric physical chest model which could be animated from laughter audio signals. Face ani- mation was not part of the work. Cosker et al [3] studied the possible mapping between facial expressions and their related audio signals for non-speech articulations including laughter. The authors used HMMs to model the audio-visual H. C ¸ akmak receives a Ph.D. grant from the Fonds de la Recherche pour l’Industrie et l’Agriculture (F.R.I.A.), Belgium. The research leading to these results has received funding from the Eu- ropean Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n ◦ 270780. correlation. As for DiLorenzo et al, the animation is audio- driven. More recent studies [4, 5] include the animation of laughter capable avatars in human-machine interaction. The proposed solutions include two different avatars animated from recorded data. One (Greta Realizer) is controlled either through high level commands using Facial Action Coding System (FACS) or low level commands using Facial Ani- mation Parameters (FAPs) of the mpeg-4 standard for facial animation. The other avatar (Living Actor) plays a set of manually drawn animations. In contrast with these works and following up our pre- vious work on acoustic laughter synthesis, we investigated the extrapolation of HMM-based synthesis to visual laugh- ter. The approach followed in the present work is to model facial expressions by means of facial landmark trajectories. First a 3D facial motion database has been recorded using the OptiTrack 1 motion capture system. Then this data has been modeled using an HMM-based approach. Synthesized trajectories were then retargeted to a 3D model into the Mo- tionBuilder software where the animation was rendered. Re- sults were evaluated through an online Mean Opinion Score (MOS) test where users were asked to rate the overall quality, the human-likeness and spontaneousness for each of the 27 videos presented in the evaluation. The paper is organized as follows : Section 2 gives an overview on the database built for the purpose of this work, Section 3 explains the laughter synthesis method, Section 4 describes the evaluation and its results and Section 5 con- cludes and gives an overview of future work. 2. THE AV-LASYN DATABASE The AV-LASYN Database is a synchronous audio-visual laughter database designed for laughter synthesis. The cor- pus contains data from one male subject and consists of 251 laughter utterances. Professional audio equipment and a marker-based motion capture system (OptiTrack) have been used for audio and facial expression recordings respectively. Figure 1 gives an overview of the recording pipeline. The database contains laughter-segmented WAV audio 1 http://www.naturalpoint.com/optitrack