CAMSHIFT Tracker Design Experiments with Intel OpenCV and SAI Alexandre R.J. Fran¸cois Institute for Robotics and Intelligent Systems University of Southern California afrancoi@usc.edu July 2004 Abstract When humans interact with computer systems, they expect the experience to meet human standards of reactiveness, robustness and, if possible, non-intrusiveness. In order for computer vision techniques to have a significant impact in human-computer interaction, the development of efficient and robust algorithms, as well as their integration and operation as part of complex (including multi-modal) sys- tems, must be specifically addressed. This report describes design and implementation experiments for CAMSHIFT-based tracking systems using Intel’s Open Computer Vision library and SAI (Software Architecture for Immersipresence), a software architecture model created specifically to address the in- tegration of different solutions to technical challenges, developed independently in separate fields, into working systems, that operate under hard performance constraints. Results show that the SAI formalism is an enabling tool for designing, describing and implementing robust systems of efficient algorithms. Keywords: Software Architecture, Perceptual User Interface, Human-Computer Interaction. 1 Introduction When humans interact with computer systems, they expect the experience to meet human standards of reactiveness, robustness and, if possible, non-intrusiveness. Reactiveness can be expressed in terms of per- ceived system latency (delay between a user’s action and the perception of the action’s effect on the system). Perceived latency results from the actual latencies and throughputs of the various processes involved in the system and their relationships. Robustness refers to the system’s ability to cope with unexpected situations. Non-intrusive Human-Computer Interaction (HCI) modalities are regrouped under the term Perceptual User Interfaces (PUIs) [13], a field in which computer vision should find ample application. Many image and video processing algorithms are now available that can be implemented to operate in real-time. However, simplicity and robustness seem mutually exclusive, and vision systems that fulfill both reactiveness and robustness requirements are very few. Efforts to improve robustness of simple and efficient techniques usually result in more complex and over-specialized algorithms that are not well suited for use in real-time systems (even with the help of Moore’s law). The work reported here is driven, in part, by the belief that computer vision performance on par with human expectations and abilities will be achieved by designing and implementing robust systems of efficient (but fallible) algorithms. In mainstream Computer Vision, the jump from algorithm to system is often taken for granted and over-simplified. Most published algorithms are tested in proof-of-concept systems whose design is not given much consideration. Intel’s Open Computer Vision library [2] regroups a large collection of standard data structures and efficient implementations of computer vision algorithms. How these algorithms may be used to design and implement real applications, or software systems, is out of the scope of the library. Among the various models available to programmers, dataflow architectures, an example of which is Microsoft’s DirectShow architecture [11], have become popular for video processing systems. Dataflow models however are not suitable for all types of applications, and in fact are particularly ill-suited for the design of interactive systems [12]. In order for computer vision techniques to have a significant impact in HCI in general, and PUIs in particular, their integration and operation as part of complex (including multi-modal) systems must be specifically addressed. 1