EXPRESSIVE HUMANOID ROBOT FOR
AUTOMATIC ACCOMPANIMENT
Guangyu Xia
1
, Mao Kawai
2
, Kei Matsuki
2
, Mutian Fu
1
, Sarah Cosentino
2
,
Gabriele Trovato
2
, Roger Dannenberg
1
, Salvatore Sessa
2
, Atsuo Takanishi
2
1
Carnegie Mellon University,
2
Waseda University
{gxia, mutianf, rbd}@andrew.cmu.edu
1
contact@takanishi.mech.waseda.ac.jp
2
ABSTRACT
We present a music-robotic system capable of performing
an accompaniment for a musician and reacting to human
performance with gestural and facial expression in real
time. This work can be seen as a marriage between social
robotics and computer accompaniment systems in order
to create more musical, interactive, and engaging perfor-
mances between humans and machines. We also conduct
subjective evaluations on audiences to validate the joint
effects of robot expression and automatic accompani-
ment. Our results show that robot embodiment and ex-
pression improve the subjective ratings on automatic ac-
companiment significantly. Counterintuitively, such im-
provement does not exist when the machine is performing
a fixed sequence and the human musician simply follows
the machine. As far as we know, this is the first interac-
tive music performance between a human musician and a
humanoid music robot with systematic subjective evalua-
tion.
1. INTRODUCTION
In order to create more musical, interactive, and engaging
performances between humans and machines, we con-
tribute the first automatic accompaniment system that
reacts to human performance with humanoid robot ex-
pression (as shown in Figure 1). This study bridges two
existing fields: social robotics and automatic accompa-
niment.
Figure 1. The robotic automatic accompaniment system.
On one hand, score following and automatic accompa-
niment systems (often briefly named automatic accompa-
niment) have been developed over the past 30 years to
serve as virtual musicians capable of performing music
with humans. Given a performance reference (usually a
score representation), these systems take human perfor-
mance as an input, match the input to the reference, and
output the accompaniment by adjusting its tempo in real
time. The first systems invented in 1984 [1][2] used sim-
ple models to anticipate the tempo of a monophonic in-
put. Ever since then, many studies extended the model to
achieve more expressive music interactions. These exten-
sions include polyphonic [3] and embellished [4] input
recognition, smooth tempo adjustment [5][6], and even
expressive reaction with music nuance [7]. While most
efforts focused on the system’s auditory aspects, two ma-
jor issues of automatic accompaniment remain unex-
plored. First, no model has considered the virtual musi-
cian’s gestural and facial expressions, despite the fact that
visual cues also serve as an important part of music inter-
action [8][9]. Second, no subjective evaluation has been
conducted to validate that automatic accompaniment is a
better solution than fixed media for human-computer
music performance.
On the other hand, social robots have been developed to
interact with humans or other agents following certain
rules of social behaviors. Many studies have shown that
robot expression, especially humanoid expression, signif-
icantly increases the engagement and interaction between
humans and computer programs in many forms, such as
telecommunication [10] and dialog systems [11]. Howev-
er, music interaction, as high-level social communication,
has not been paid much attention in this context. Though
we have seen the development of several music robots,
none are able to react to other musicians with human-like
expression yet.
It is clear to see that automatic accompaniment and so-
cial robotics can complement each other. Therefore, we
integrated the saxophonist robot developed at Waseda
University into an existing framework of automatic ac-
companiment. To be specific, the system currently takes a
human musician’s MIDI flute performance as input and
outputs acoustic accompaniment with gestural and facial
expression. The (larger scale) gestural expression reacts
to music phrases while the (smaller scale) facial expres-
sion reacts to local tempo changes. Of course, our first
integration does not consider all aspects of gestural and
facial expression. The current solution considers body
and eyebrow movements, and we believe that other as-
pects of expression can be processed in a similar way.
Copyright: © 2016 Guangyu Xia et al. This is an open-access article dis-
tributed under the terms of the Creative Commons Attribution License 3.0
Unported, which permits unrestricted use, distribution, and reproduction
in any medium, provided the original author and source are credited.