1 A virtual-agent head driven by musical performance Maurizio Mancini, Roberto Bresin and Catherine Pelachaud Abstract— In this paper we present a system in which visual feedback of an acoustic source is given to the user using a graphical representation of an expressive virtual head. In this system we also included the notion of expressivity of the human behavior. We provide several mapping. On the input side, we have elaborated a mapping between values of acoustic cues and emotion as well as expressivity parameters. On the output side, we propose a mapping between these parameters and the behaviors of the virtual head. These mappings ensure a coherency between the acoustic source and the animation of the virtual head. After presenting some background information on expressivity of humans we introduce our model of expressivity. We explain how we have elaborated the mappings between the acoustic and the behavior cues. Then we describe the implementation of a working system that controls the behavior of a human-like head that varies depending on the emotional and acoustic characteristics of the musical execution. Finally we present the tests we conducted to validate our model of mapping between music performance emotions and expressivity parameters. Index Terms— acoustic cues, music, emotion, virtual agent, expressivity I. I NTRODUCTION W HAT happens when it is a computer listening to the music? In HCI applications affective communication plays an increasingly important role. It would be helpful if systems could express what they perceive and communicate it to the human user through visual and acoustic feedbacks. Listening to music is an everyday experience. But why do we do it? For example one could do it for tuning her own mood. Research results show that we are not only able to recognize different emotional intentions used by musicians or speakers [1] but that we also feel these emotions. It has been found that when listening to music, people experience a change in bio-physical cues (such as blood pressure, etc.). This change may correspond to either the feeling of the emotion arising from listening the music or the recognition of the emotion evoked by the music [2]. Virtual agents with a human-like appearance and commu- nication capabilities are being used in an increasing number of applications for their ability to convey complex information through verbal and nonverbal behaviors like voice, intonation, gaze, gesture, facial expressions, etc. Their capabilities are useful when being a presenter on the web [3], a pedagogical agent in tutoring systems [4], a talking head helping hearing- impaired people to “listen” to a telephone call by lipreading [5], a companion in interactive setting in public places such as museums [6], [7], or even a character in virtual story- telling systems [8]. The expressivity of behaviors, that is the way behaviors are executed, is also an integral part of the communication process as it can provide information on the state of an agent, such as current emotional state, mood, and personality [9]. In our work we implemented a system that gives to the user a visual feedback by moving and modifying the expression of a virtual agent’s head. The agent’s behavior (movement plus expression) explicitly visualizes the emotional intentions of the musical execution. It is meant to show the direct connection between body motion and expressivity in music performance [10]. In the next section we present the state of the art. Then we give some background information on expressivity for human behavior, voice and music execution. Perceptual tests on expressivity of body and vocal cues are also provided. Then, in Section V we introduce our real-time application for visual feedback of musical execution. We provide information on the mapping between acoustic cues and animation parameters. In section VI we describe the tests we conducted to validate our model of mapping between music performance emotions and expressivity parameters. Finally we conclude the paper. II. STATE OF THE ART Some previous works [11]–[13] have addressed the gen- eration of synthetic human behavior depending on music (or sound) input. The works by Lee et al. [14] and by Cardle et al. [12] are mainly focused on adapting pre-calculated animations like walking or dancing to a given music input. These systems analyze the music and extract parameters such as tempo. Based on the values of the extracted parameters, the rhythm of the animation is changed. In the works by Cornwell et al. [15] and by Downie and Lefford [13] interaction between agents are modulated by music and sound. The emotive content of the acoustic source are positively correlated to the quality of the interaction between agents. For example a group of agents will tend to collaborate more if listening to a happy and positive piece of music [15]. [13] also underlines that music can help to give life to inanimate objects, increasing their credibility. Taylor et al. [16] developed a system that allows a user to adapt the way she plays a music instrument to the reaction of a virtual character. The user has to try to vary her execution to make virtual character reacts in some desired way. Our work is most similar to DiPaola et al.’s work [11]. The authors emphasize that affective information can be de- livered through several means (music, facial expression, body movement, etc) by translating the original message into the language used by each mean. So if music is the starting mean and facial expression is the output mean, the system elaborates the information coming from the music and translates it into facial expressions and head movements. As in this work [11], we view that the translation, that is the mapping, between cues from one mean to another one is of high relevance.