1 PAPER Man Machine Interaction Using A Vision System with Dual ViewingAngles Ying-Jieh HUANG † , Member, Hiroshi DOHI †† , Nonmember and Mitsuru ISHIZUKA †† , Member SUMMARY This paper describes a vision system with dual viewing angles, i.e., wide and narrow viewing angles, and a scheme of user-friendly speech dialogue environment based on the vision system. The wide viewing angle provides a wide viewing field for wide range motion tracking, and the narrow viewing angle is capable of following a target in wide viewing field to take the image of the target with sufficient resolution. For a fast and robust motion tracking, modified motion energy (MME) and existence energy (EE) are defined to detect the motion of the target and extract the motion region at the same time. Instead of using a physical device such as a foot switch commonly used in speech dialogue systems, the begin/end of an utterance is detected from the movement of user’s mouth in our system. Without recognizing the movement of lips directly, the shape variation of the region between lips is tracked for more stable recognition of the span of a dialogue. The tracking speed is about 10 frames/sec when no recognition is performed and about 5 frames/sec when both tracking and recognition are performed without using any special hardware. key words: vision system, dual viewing angles, speech dialogue system, motion tracking, mouth pattern recognition 1 Introduction During the last thirty years, a major research goal in computer system field has been to make computers intelligent, to work with us, and to be our helpers. An average of 48% of the code in today’s application is devoted to the user interface portion according to the results of a survey on human computer interface programming [1]. Despite so much effort, however, computers today still remain difficult to use in common human life. Users have to sit in front of an output device or wear some troublesome device like a goggle, typing on a keyboard, moving a mouse, clicking buttons to express his/her intention. The limitations of interface between the user and the computer restrict the integration of computing power into various human tasks and various daily life styles. Computer vision makes it possible for a user to use any convenient objects as input signal. These objects include orientation of head [2][3], gaze direction of eyes [4][5][6], finger tips [7], hand gestures [8], mouth movement [9] [10] and even facial expression [11]. The use of computer vision is a key component to realize more free and friendly human interfaces. Since human eye is one of the most developed visual system and is well studied, many vision systems are modeled on the base of it. The vision systems developed so far are summarized in Table 1 according to the number of cameras used. Only the abilities are listed in Table 1, no matter how well they done. More detailed information about vision systems can be found, for example, in [12]. Table 1 The summary of vision system Tracking range Resolution Vergence 3D information acquisition Comment no backgroun compensation is need yes no gaze selection yes the use of third camera not yet reported no only one resolution is used yes Monocular vision system Wide single uniform/variable no Binocular vision system Narrow single uniform yes Trinocular vision system Wide two unifrom Two cameras vision system Wide two uniform no two unifrom/variable yes Dual viewing angles vision system Wide no When computer vision is used for human computer interaction, the attentive visual search is one of the important factors. A complete human computer interaction should be started automatically when a user enters its viewing field and be ended when the user away from its viewing field. This means that the computer vision for human computer interaction should be able to aware of the existence of user automatically. The required image resolution for recognizing the action of the user is clearly not the same as the one for tracking the motion of the user. This implies that using only one resolution in human action recognition is insufficient. In this paper, we describe a vision system with dual viewing angles for human computer interaction. The motion tracking and feature recognition of a user in front of it will be done under different image resolution, and a spontaneous dialogue environment constructed with this vision system Manuscript received May 16, 1996. Manuscript revised May 14, 1997. † The author is with the Information and Communication R&D Center of Ricoh Co. Ltd., Yokohama-shi, 222 Japan. ††The authors are with the Dept. of Information and Com- munication Engineering, the University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan.