Proceedings of ICVS 2003, pp. 257-267 Proceedings of ICVS 2003, pp. 257-267 Proceedings of ICVS 2003, pp. 257-267 Proceedings of ICVS 2003, pp. 257-267 Proceedings of ICVS 2003, pp. 257-267 Proceedings of ICVS 2003, pp. 257-267 Proceedings of ICVS 2003, pp. 257-267 Proceedings of ICVS 2003, pp. 257-267 Proceedings of ICVS 2003, pp. 257-267 Proceedings of ICVS 2003, pp. 257-267 April 2003, Graz, Austria VICs: A Modular Vision-Based HCI Framework Guangqi Ye, Jason Corso, Darius Burschka, and Gregory D. Hager The Johns Hopkins University Computational Interaction and Robotics Laboratory cips@cs.jhu.edu Abstract. Many Vision-Based Human-Computer Interaction (VB-HCI) systems are based on the tracking of user actions. Examples include gaze- tracking, head-tracking, ﬁnger-tracking, and so forth. In this paper, we present a framework that employs no user-tracking; instead, all interface components continuously observe and react to changes within a local image neighborhood. More speciﬁcally, components expect a pre-deﬁned sequence of visual events called Visual Interface Cues (VICs). VICs in- clude color, texture, motion and geometric elements, arranged to maxi- mize the veridicality of the resulting interface element. A component is executed when this stream of cues has been satisﬁed. We present a general architecture for an interface system operating un- der the VIC-Based HCI paradigm, and then focus speciﬁcally on an appearance-based system in which a Hidden Markov Model (HMM) is employed to learn the gesture dynamics. Our implementation of the sys- tem successfully recognizes a button-push with a 96% success rate. The system operates at frame-rate on standard PCs. 1 Introduction The promise of computer vision for human-computer interaction (HCI) is great: vision-based interfaces would allow unencumbered, large-scale spatial motion. They could make use of hand gestures, movements or other similar input means; and video itself is passive, (now) cheap, and (soon) nearly universally available. In the simplest case, tracked hand motion and gesture recognition could replace the mouse in traditional applications. But, computer vision oﬀers the additional possibility of deﬁning new forms of interaction that make use of whole body motion, for example, interaction with a virtual character [17]. A brief survey of the literature (see Section 1.1) reveals that most reported work on vision-based HCI relies heavily on visual tracking and visual template recognition algorithms as its core technology. While tracking and recognition are, in some sense, suﬃcient for developing general vision-based HCI, one might ask if they are always necessary and if so, in what form. For example, complete, constant tracking of human body motion, while diﬃcult because of complex kinematics [21], might be a convenient abstraction for detecting that a user’s hand has touched a virtual “button,” but what if that contact can be detected using simple motion or color segmentation? What if the user is not in a state