1 Mobile Interaction with Remote Worlds: The Acoustic Periscope Justinian Rosca Sandra Sudarsky Radu Balan Dorin Comaniciu Siemens Corporate Research, Inc. 755 College Road East Princeton, NJ 08540 USA +1 609 734 6500 {rosca, sudarsky, rvbalan,comanici}@scr.siemens.com ABSTRACT Strictly speaking, a periscope is an optical device that allows one to view and navigate the external environment. The acoustic peri- scope is a metaphor for mobile interaction that transparently ex- ploits audio/speech to navigate and provide an unobstructed scene in a real or virtual world. We aim at both true mobility – no strings or devices should be attached to human user to be able navigate – and at a smart multi modal. The implementation of our concept highlights un underestimated modality, the acoustic one, for making computers transparent to the actual interaction of the user with a remote world and advancing in the direction of ubiqui- tous computing. In this paper we describe the basic principles, architecture and implementation of a system for ubiquitous, mul- timodal and easy visual accesses to the remote world based on the acoustic periscope idea. In order to assemble the required func- tionality we resort to audio signal processing (in particular array signal processing) for location and orientation estimation, speech recognition and text-to-speech synthesis for natural language in- teraction, mobile computing, communication in a LAN/Bluetooth network, and streaming of data from or control of a remote telero- botic platform with vision capabilities. Keywords Virtual Reality, Multi-Modal Interaction, PDA, Ubiquitous Com- puting, Mobile Interaction, Smart User Interface. 1. INTRODUCTION Imagine a mobile robot carrying a tilt-and-pan controllable cam- era. Our robot is actually a vehicle that would let us remotely explore, for instance, the Rodin Museum of Art in Philadelphia after hours. To appreciate and enjoy sculpture, one has to depart from an apparently rigid and columnar structure, and follow the dynamic flowing lines in Rodin’s sculpture. For this, one has to be mobile around the sculpture. Can our robot do this and stream images to a remote display? How would one control it and its camera? Not remote mouse or key movements, please! The latter approach, although possible, is clearly awkward for this goal. What we would really like is make the robot and its camera smoothly “fly” around the sculpture, and stream the correspond- ing images to the user display. Moreover, we want the user to actively use her body to search for knowledge in this process. Can the user just naturally and transparently move in her envi- ronment with a PDA in her hand, and have the robot follow a similar trajectory in its real (or virtual) environment and stream images onto the PDA? We impose one final constraint: the user should not be tethered in the environment, and the whole process should not necessitate expensive position and orientation sensors mounted on the user’s head. Localization and orientation, if pos- sible, should be naturally based on the user‘s voice and the syn- thesized speech answers generated by the PDA. The user communicates with her PDA mostly by speech. Visiting the museum after hours by letting the user herself be a virtual acoustic-based periscope in the “other” world be interest- ing and intriguing. Perhaps more realistic is an industrial applica- tion, such as smoothly exploring in 3-D the hidden intricacies of a hardly accessible machinery for diagnosis or repair, or exploring a high-risk or industrial environment. Many other virtual reality or tele-robotics applications are possible by means of the acoustic periscope technique. This paper describes the architecture and an implementation of our virtual periscope approach in a natural, unobtrusive and inex- pensive way. The user only carries with her the PDA, which represents both the virtual window into the other world (museum, real or virtual environment to explore), and the mobile device for dictation and speech commands. The “Rodin museum” experi- ment is possible in a dedicated room where a system of micro- phones makes it possible to localize sources of sound (human user, PDA). The structure of the paper is as follows. Section 2 defines more precisely some of the concepts used here and throughout the pa- per. Section 3 describes the architecture of the system for mobile interaction and telecontrol. Section 4 discusses various implemen- tation issues. Section 5 presents a hardware realization of our system. Finally we summarize this effort and present some chal- lenges for present and future work. 2. SCOPE AND RELATED WORK The acoustic periscope metaphor and the applications reviewed could be viewed from perspectives that can be cast in several con- secrated ways: virtual reality, artificial reality, augmented reality. Below we define these main terms and highlight the nuances ex- ploited in our scenario. Virtual reality is the process of actively stepping inside (to see, hear, act upon) a computer generated, virtual environment. It usu- ally assumes the use of a head-mounted audio/video display, and position and orientation sensors [1],[2]. This is the general sce- nario we use, although the applications we mentioned here do not exploit a virtual world – they could as well do that. As a virtual reality does, we also simulate another place to the user by present-