TALKING HEADS AND SYNTHETIC SPEECH: AN ARCHITECTURE FOR SUPPORTING ELECTRONIC COMMERCE Jörn Ostermann, David Millen AT&T Labs - Research 100 Schultz Dr., Red Bank, NJ 07701, USA email: {osterman,drm}@research.att.com ABSTRACT Facial animation has been combined with text-to-speech synthesis to create innovative multimodal interfaces. In this paper, we present an architecture for this multimodal interface. A face model is downloaded from a server into a client. The client uses an MPEG-4 compliant speech synthesizer that animates the head. The server sends text and animation data to the client in addition to regular content to be displayed in a web browser. We believe that this architecture can support electronic commerce by providing a more friendly, helpful and intuitive user interface when compared to a regular web browser. In order to substantiate these claims, we undertook experiments to understand user reaction to interactive services designed with synthetic characters. In one experiment, participants played the 'Social Dilemma' game with the computer as a partner. Results indicate that users cooperate more with a computer when an animated face is representing the computer during the game. A simulated commercial application was evaluated, also comparing the performance of facial animation, text-to-speech and text only conditions. According to the results, the use of facial animation in the design of interactive services was favorably rated for most of the attributes in these experiments. Further, the results show that facial animation may effectively fill application-waiting times and make delays more acceptable to the users. 1. INTRODUCTION Computer simulation of human faces has been an active research area for some time, resulting in the development of a variety of facial models and the development of several animation systems [3][4][7][13][15][21][24]. The advances in animation systems, such as those mentioned above, have prompted interest in the use of animation to enrich the human-computer interface. This prompted ISO to support animation of talking faces and bodies in the MPEG-4 standard [9][10][11][17][19][20]. One important application of animated characters has been to make the human computer interface more compelling and easier to use. For example, animated characters have been used in presentations systems to help attract the user's focus of attention, to guide the user through several steps in a presentation, and to add expressive power by presenting nonverbal conversational and emotional signals [1][22]. Animated guides or assistants have also been used with some success in user help systems [2][6][8] and for user assistance in web navigation [16]. Character animation has also been used in the interface design of communication or collaboration systems. There are several multi- user systems that currently use avatars, which are animated representations of individual users [24][21]. In many cases, the avatar authoring tools and online controls remain cumbersome. The social cues that are needed to mediate social interaction in these new virtual worlds have been slow to develop, and have resulted in frequent communication misunderstandings [9]. Nevertheless, the enormous popularity of Internet chat applications suggests considerable future use of avatars in social communication applications. In Section 2, we present the client and server architecture for supporting web-based applications like electronic commerce with text-to-speech speech (TTS) synthesis and facial animation (FA). In Section 3, we show the use of facial animation in an information kiosk. We also present the results of subjective tests using this information kiosk. In Section 4, we present the 'Social dilemma' experiment that shows how FA and TTS can influence the users in their interaction with a computer. 2. ARCHITECTURE FOR TTS AND FACIAL ANIMATION FOR WEB-BASED APPLICATIONS In order to enable web-based FA on a client, the client requires a web browser, a TTS and a FA renderer (Figure 1). Usually, the settings like speech rate of a TTS are determined by the preferences of the user. A server does therefore not know them. In order to enable synchronized speech and facial animation, the TTS must provide phonemes and related timing information to the FA renderer. Using a coarticulation model [4], the renderer can then animate a model downloaded from a server and let it move its mouth synchronously to the speech of the TTS. Driving the face model using the text of the TTS does not allow for animating non-speech related actions like smiles or head nodding. Therefore, the TTS has to handle bookmarks that contain these facial animation parameters (FAP). The bookmarks are placed in the text and their timing is determined by the TTS from the start time of the word following the bookmark [11][18][23]. The server for this client comprises a web server, a TTS/FA server and a database of face models (Figure 1). They are controlled by the application, which could be implemented as a CGI script. When the application is started due to a client request, it downloads a face model from the model library to the client. We use VRML [12] as a file format for the face models. In