Simplifying the Integration of Virtual Humans into Dialog-like VR Systems Yvonne Jung * Johannes Behr † Fraunhofer Institut f¨ ur Graphische Datenverarbeitung Technische Universit¨ at Darmstadt Darmstadt, Germany ABSTRACT In this paper we describe an X3D-based framework for simplifying the integration of virtual characters into dialog-based VR systems, by introducing another level of abstraction on top of X3D by means of a higher level language, which can be used for module communi- cation and for coordinating the conversational behavior of virtual humans. Therefore, we propose a self-contained and integrated sys- tem with matching techniques and building blocks that not only pro- vides flexible control of the character, but also considers resultant dependencies, which need to be simulated during runtime. Further- more, our system also takes physiological processes into account, which is essential for the correct perception of some emotions in the context of nonverbal communication. Thus, our proposed approach offers more efficiency by means of the integration into more abstract system architectures, into well established visualization techniques like the scene-graph, and into existing open standards. Index Terms: H.5.1 [Information Interfaces and Presentation (e.g., HCI)]: Multimedia Information Systems—Artificial, aug- mented, and virtual realities; I.3.7 [Computer Graphics]: Three- Dimensional Graphics and Realism—Animation 1 I NTRODUCTION During the past few years there has been an increasing interest in virtual characters – not only in Virtual Reality (VR), computer games or online communities such as Second Life, but also for dialog-based systems like tutoring systems or edutainment and in- fotainment applications. This is directly associated with the ma- jor challenges of Human-Computer-Interface (HCI) technologies in general and immersive Virtual Reality concepts in particular, as they are both aimed at developing intuitive man-machine-interfaces instead of the standard WIMP style of human-computer interaction, which basically has not changed for more than two decades. Suitable interaction metaphors for dialog-based systems are guidance (which can be achieved by narration and thus digital story- telling techniques) and natural dialog (by providing conversational user interfaces with responsive virtual humans) – the latter is an ability that people practice every day and in every face-to-face con- versation. By simulating communicative behavior including ver- bal and nonverbal communication natural user interfaces can thus be provided. Both previously mentioned concepts are at first typ- ical areas of research in Interaction Design and Artificial Intelli- gence (AI), and usually follow a goal and communication driven ap- proach. This is often achieved in combination with an ontology (for knowledge management) by first defining certain goals on a very high level of abstraction (e.g. “Explain usage of device X”), which are then further refined. Mostly this is done by different modules that are responsible for dialog generation, speech synthetization, gesture control, adaption to various emotions etc. Typically, each * e-mail: yvonne.jung@igd.fraunhofer.de † e-mail:johannes.behr@igd.fraunhofer.de module adds more information until the result is concrete enough for being visualized by a rendering engine. Human-like communication requires synchronicity and consis- tency between modalities (e.g. speech with lip-sync and the corre- sponding gesture or posture) as well as plausibility of appearance and behavior. This requires a lot of functionalities on a lower level of abstraction, which includes the integration of all relevant ele- ments into the system as transparently as possible. Therefore, var- ious demands need to be met, especially in the context of multi- disciplinary collaboration between computer graphics, AI, cogni- tive psychology and so on. Due to the complexity of this topic, first of all this affords manageable, modular system architectures. But currently application development is difficult and inefficient, par- ticularly if a suitable infrastructure concerning tool chain and con- tent creation pipeline (which still requires expensive tools, time and manpower) is not yet built up. A good example here is the games industry, where every company has its own tools and engines. Although the interdependence of different modalities has to be considered, there often only exist standalone systems for specific applications like chat-rooms or a special type of game on the one hand, and specialized tools as well as isolated applications (e.g. for simulating complex hairstyles) on the other hand. Furthermore, the used techniques usually are not applicable to other types of appli- cations and there are no readily usable standard components or at least common standards for low and high level behavior descrip- tion. And last but not least, in the area of dialog systems research normally focuses on face and body animation whilst ignoring is- sues concerning rendering and simulation, although increasingly powerful computers and graphics cards allow for a more realistic character and scene design. In order to alleviate some of these problems and to provide a ba- sis for a sustainable solution we therefore propose a framework that builds on the open ISO standard X3D [25], which is used as the application description language. Furthermore, we simplify char- acter control by dividing the various aspects into different layers of complexity, as shown in Figure 1. This hierarchy can be roughly categorized into a control layer for behavior description and script- ing on the one hand, and an execution layer for providing all the building blocks that are necessary to fulfill the requests of the con- trol layer on the other hand. Finally, we further distinguish between consciously controlled actions like gestures or mimics and uncon- sciously happening phenomena. The latter can either be psycho- physiological processes like crying and blushing, which usually are more or less ignored in current research, or adjoint effects that di- rectly follow from the laws of physics and cannot be animated in advance (e.g. shadows or hair blowing in the wind). The rest of the paper is organized as follows. In Section 2, re- lated work is discussed. Section 3 gives a brief overview of the proposed framework architecture and describes the high-level con- trol interface. Section 4 then outlines the corresponding low-level building blocks, and in Section 5 we conclude the paper. 2 RELATED WORK As mentioned, dealing with embodied conversational agents (ECA) requires multi-disciplinary collaboration between different fields of