Is Putting a Face on a Robot Worthwhile? Enas Altarawneh, Michael Jenkin and I. Scott MacKenzie 1 Abstract— Putting an animated face on an interactive robot is great fun but does it actually make the interaction more effective or more useful? To answer these questions, human- robot interactions using text, audio, a realistic avatar, and a simplistic cartoon avatar were compared in a user study with 24 participants. Participants expressed a high level of satisfaction with the accuracy and speed of all the interfaces used. Although the response time was longer for both the cartoon and realistic avatar interfaces (due to their increased computational cost), this had no effect on participant satisfaction. Participants found the avatar interfaces more fun to use than the traditional text- and audio-based interfaces, but there was no significant difference between the two avatar-based interfaces. Putting a face on a robot may make a robot more fun to interact with, and the face may not have to be that realistic. I. I NTRODUCTION From their experience with robots on TV and in the movies, na¨ ıve users expect robots to present a human-like appearances and expressions, and to respond in a natural and appropriate manner. But are such approaches really what the user wants and are they really effective? Is it better to put an animated face on the robot or is a traditional text or audio interface more effective? To consider these questions we used an approach similar to Liang et al. [1] to evaluate the relative performance in responding to queries with a text- only response (T), an audio-only response (A), an avatar response that relies on a cartoon 3D avatar (CA) (Fig. 1a) and a realistic avatar (RA) (Fig. 1b). All interfaces used a common underlying speech recognition and knowledge engine to obtain text responses to participant queries: The text interface displayed the response as text on a screen and then displayed a text prompt that indicates that the interface was ready for the next question. The audio interface generated an audio response, played the audio response, then displayed a text prompt indicating that the interface is ready for the next question. The cartoon avatar provided an audio response loosely synchronized with a cartoon avatar. The cartoon avatar synchronized its lip motion with the audio using two visual states, mouth closed and mouth open, to provide simple and computationally inexpensive lip synchronization to the audio responses. The realistic avatar played the audio response synchronized with the animated character. The design of the realistic avatar is sketched below and for a more complete description of the realistic avatar interface see [2]). A questionnaire was administered to each participant be- fore the experiment and after their interaction with all of the 1 Authors are with Faculty of Electrical Engineering and Computer Science, York University , Toronto, Canada enas, jenkin, mack @eecs.yorku.ca interfaces. The responses captured participants’ demographic data and their perceptions of the interfaces. After interacting with a given interface interaction, information related to that interface was gathered. The results presented here are part of a larger study [3] 1 . The empirical evaluation and analysis follows methods detailed by MacKenzie [4]. Ethics approval for this study was granted from the Office of Research Ethics of an anonymous university. II. PRIOR WORK An artificially intelligent agent is an autonomous entity that observes the environment through sensors and acts upon it using actuators, directing its activity towards achieving a specific set of goals [5]. An intelligent agent has applications in almost every field. A common theme in intelligent agents is the use of anthropomorphic features as a mechanism to structure interactions with the user. Putting a “head” on the intelligent agent gives the user something to talk to. This concept of an interactive avatar can be found in interactive displays more generally. Interactive avatars and virtual agents have been used as the basis of the interface for a range of applications including home care monitoring and compan- ionship (see[6]), and interactive avatars are commonplace in online shopping (see[1], [7], [8]). Interactive avatars are inherently multi-modal in nature and can enable a more intimate relation between the user and the avatar then is the case for more traditional user interface technologies [5]. But what are the appropriate set of interactive modalities to use in an intelligent agent and what is the necessary fidelity of these modalities? An interactive avatar typically relies on text to speech and speech understanding technologies to provide voice inter- action and couples this with a synchronized visual display. Applications that use natural language as an interface engage in conversations as humans naturally do. There are many examples of this type of interaction including commonality systems such as Siri [9], Alexa [10] and Cortana [11]. But what are the advantages and disadvantages of the various in- teraction approaches? For example, Medicherla and Sekmen [12] report results of a user study that indicates that voice- control and the ability of spatial reasoning were reliable indicators of efficiency in robot teleoperation. In this study 75% of the subjects who demonstrated a high ability to apply spatial reasoning favored using voice-control over manual control. But are voice-based interfaces preferred? Voice- based approaches can produce realistic audio, but human 1 The financial support from anonymous projects are gratefully acknowl- edged.