CAMSHIFT Tracker Design Experiments with Intel OpenCV and SAI Alexandre R.J. Fran¸cois Institute for Robotics and Intelligent Systems University of Southern California afrancoi@usc.edu July 2004 Abstract When humans interact with computer systems, they expect the experience to meet human standards of reactiveness, robustness and, if possible, non-intrusiveness. In order for computer vision techniques to have a signiﬁcant impact in human-computer interaction, the development of eﬃcient and robust algorithms, as well as their integration and operation as part of complex (including multi-modal) sys- tems, must be speciﬁcally addressed. This report describes design and implementation experiments for CAMSHIFT-based tracking systems using Intel’s Open Computer Vision library and SAI (Software Architecture for Immersipresence), a software architecture model created speciﬁcally to address the in- tegration of diﬀerent solutions to technical challenges, developed independently in separate ﬁelds, into working systems, that operate under hard performance constraints. Results show that the SAI formalism is an enabling tool for designing, describing and implementing robust systems of eﬃcient algorithms. Keywords: Software Architecture, Perceptual User Interface, Human-Computer Interaction. 1 Introduction When humans interact with computer systems, they expect the experience to meet human standards of reactiveness, robustness and, if possible, non-intrusiveness. Reactiveness can be expressed in terms of per- ceived system latency (delay between a user’s action and the perception of the action’s eﬀect on the system). Perceived latency results from the actual latencies and throughputs of the various processes involved in the system and their relationships. Robustness refers to the system’s ability to cope with unexpected situations. Non-intrusive Human-Computer Interaction (HCI) modalities are regrouped under the term Perceptual User Interfaces (PUIs) [13], a ﬁeld in which computer vision should ﬁnd ample application. Many image and video processing algorithms are now available that can be implemented to operate in real-time. However, simplicity and robustness seem mutually exclusive, and vision systems that fulﬁll both reactiveness and robustness requirements are very few. Eﬀorts to improve robustness of simple and eﬃcient techniques usually result in more complex and over-specialized algorithms that are not well suited for use in real-time systems (even with the help of Moore’s law). The work reported here is driven, in part, by the belief that computer vision performance on par with human expectations and abilities will be achieved by designing and implementing robust systems of eﬃcient (but fallible) algorithms. In mainstream Computer Vision, the jump from algorithm to system is often taken for granted and over-simpliﬁed. Most published algorithms are tested in proof-of-concept systems whose design is not given much consideration. Intel’s Open Computer Vision library [2] regroups a large collection of standard data structures and eﬃcient implementations of computer vision algorithms. How these algorithms may be used to design and implement real applications, or software systems, is out of the scope of the library. Among the various models available to programmers, dataﬂow architectures, an example of which is Microsoft’s DirectShow architecture [11], have become popular for video processing systems. Dataﬂow models however are not suitable for all types of applications, and in fact are particularly ill-suited for the design of interactive systems [12]. In order for computer vision techniques to have a signiﬁcant impact in HCI in general, and PUIs in particular, their integration and operation as part of complex (including multi-modal) systems must be speciﬁcally addressed. 1