PARIS: FUSING VISION-BASED LOCATION TRACKING WITH STANDARDS-BASED 3D VISUALIZATION AND SPEECH INTERACTION ON A PDA Stuart Goose, Sinem Güven†, Xiang Zhang, Sandra Sudarsky, Nassir Navab Multimedia Technology Department, Siemens Corporate Research, Inc. 755 College Road East, Princeton, NJ 08540, USA †Columbia University, 500 West 120th Street, New York, NY, 10027, USA ABSTRACT Industrial service and maintenance is by necessity a mobile activity, and the aim of the technology reported is towards improving automated support for the technician in this endeavor. As such, a framework was developed called PARIS (PDA-based Augmented Reality Integrating Speech) that executes entirely on a commercially available PDA equipped with a small camera and wireless support. Real-time computer vision- based techniques are employed for automatically localizing the technician within the plant. Once localized, PARIS offers the technician a seamless multi- modal user interface juxtaposing a VRML augmented reality view of the industrial equipment in the immediate vicinity and initiates a context-sensitive VoiceXML speech dialog concerning the equipment. Integration with the plant management software enables PARIS to access equipment status wirelessly in real-time and present it to the technician accordingly. 1. INTRODUCTION AND MOTIVATION Siemens is the world's largest supplier of products, systems, solutions and services in the industrial and building technology sectors. Service and maintenance is by necessity a peripatetic activity, and as such one continuing aspect of our research focuses upon improving automated support for this task. Another future trend that we have been focusing on is applying 3D interaction and visualization techniques to the industrial automation domain. In recent years we have witnessed the remarkable commercial success of small screen devices, such as cellular phones and Personal Digital Assistants (PDAs). Keyboards remain the most popular input device for desktop computers. However, performing input efficiently on a small mobile device is more challenging. Speech interaction on mobile devices has gained in currency over recent years, to the point now where a significant proportion of mobile devices support or include some form of speech recognition. The ability to model real world environments and augment them with animations and interactivity has benefits over conventional interfaces. However, navigation and manipulation in 3D graphical environments can be difficult, and disorientating, especially when using a conventional mouse. Small sensors can be used to report various data about the surrounding environment and relative movement, etc. One such sensor is that of a small camera. The hypothesis that motivated this research is that a camera, in conjunction with computer vision algorithms, could be exploited to provide location information, which in turn, could seamlessly and automatically drive the navigation through a 3D graphical world representing selected elements in the real world. In addition to eradicating partially the complexity of 3D navigation, integrating context-sensitive speech interaction could further simplify and enrich the mobile interaction experience. Hence, the PARIS framework was developed for experimenting with the provision of mobile, context-sensitive, multi-modal user interfaces for mobile maintenance. Figure 1: A mobile maintenance technician using PARIS. To the knowledge of the authors, this is the first reported VRML-based AR framework that executes entirely on a commercially available PDA. PARIS employs real-time vision algorithms for localizing a technician and offers a multimodal user interface that synchronizes an augmented reality graphical view based on VRML [22] with a VoiceXML [21] speech-driven interface. After automatically detecting when the technician enters the vicinity of a specific plant component, PARIS can engage him or her in a context-