EXPRESSIVE HUMANOID ROBOT FOR AUTOMATIC ACCOMPANIMENT Guangyu Xia 1 , Mao Kawai 2 , Kei Matsuki 2 , Mutian Fu 1 , Sarah Cosentino 2 , Gabriele Trovato 2 , Roger Dannenberg 1 , Salvatore Sessa 2 , Atsuo Takanishi 2 1 Carnegie Mellon University, 2 Waseda University {gxia, mutianf, rbd}@andrew.cmu.edu 1 contact@takanishi.mech.waseda.ac.jp 2 ABSTRACT We present a music-robotic system capable of performing an accompaniment for a musician and reacting to human performance with gestural and facial expression in real time. This work can be seen as a marriage between social robotics and computer accompaniment systems in order to create more musical, interactive, and engaging perfor- mances between humans and machines. We also conduct subjective evaluations on audiences to validate the joint effects of robot expression and automatic accompani- ment. Our results show that robot embodiment and ex- pression improve the subjective ratings on automatic ac- companiment significantly. Counterintuitively, such im- provement does not exist when the machine is performing a fixed sequence and the human musician simply follows the machine. As far as we know, this is the first interac- tive music performance between a human musician and a humanoid music robot with systematic subjective evalua- tion. 1. INTRODUCTION In order to create more musical, interactive, and engaging performances between humans and machines, we con- tribute the first automatic accompaniment system that reacts to human performance with humanoid robot ex- pression (as shown in Figure 1). This study bridges two existing fields: social robotics and automatic accompa- niment. Figure 1. The robotic automatic accompaniment system. On one hand, score following and automatic accompa- niment systems (often briefly named automatic accompa- niment) have been developed over the past 30 years to serve as virtual musicians capable of performing music with humans. Given a performance reference (usually a score representation), these systems take human perfor- mance as an input, match the input to the reference, and output the accompaniment by adjusting its tempo in real time. The first systems invented in 1984 [1][2] used sim- ple models to anticipate the tempo of a monophonic in- put. Ever since then, many studies extended the model to achieve more expressive music interactions. These exten- sions include polyphonic [3] and embellished [4] input recognition, smooth tempo adjustment [5][6], and even expressive reaction with music nuance [7]. While most efforts focused on the system’s auditory aspects, two ma- jor issues of automatic accompaniment remain unex- plored. First, no model has considered the virtual musi- cian’s gestural and facial expressions, despite the fact that visual cues also serve as an important part of music inter- action [8][9]. Second, no subjective evaluation has been conducted to validate that automatic accompaniment is a better solution than fixed media for human-computer music performance. On the other hand, social robots have been developed to interact with humans or other agents following certain rules of social behaviors. Many studies have shown that robot expression, especially humanoid expression, signif- icantly increases the engagement and interaction between humans and computer programs in many forms, such as telecommunication [10] and dialog systems [11]. Howev- er, music interaction, as high-level social communication, has not been paid much attention in this context. Though we have seen the development of several music robots, none are able to react to other musicians with human-like expression yet. It is clear to see that automatic accompaniment and so- cial robotics can complement each other. Therefore, we integrated the saxophonist robot developed at Waseda University into an existing framework of automatic ac- companiment. To be specific, the system currently takes a human musician’s MIDI flute performance as input and outputs acoustic accompaniment with gestural and facial expression. The (larger scale) gestural expression reacts to music phrases while the (smaller scale) facial expres- sion reacts to local tempo changes. Of course, our first integration does not consider all aspects of gestural and facial expression. The current solution considers body and eyebrow movements, and we believe that other as- pects of expression can be processed in a similar way. Copyright: © 2016 Guangyu Xia et al. This is an open-access article dis- tributed under the terms of the Creative Commons Attribution License 3.0 Unported, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.