A Cognitive Vision System for Action Recognition in Ofﬁce Environments C. Bauckhage, M. Hanheide, S. Wrede, and G. Sagerer Bielefeld University, Faculty of Technology P.O. Box 100131, 33501 Bielefeld, Germany {cbauckha, mhanheid, swrede, sagerer}@techfak.uni-bielefeld.de Abstract The emerging cognitive vision paradigm is concerned with vision systems that evaluate, gather and integrate con- textual knowledge for visual analysis. In reasoning about events and structures, cognitive vision systems should rely on multiple computations in order to perform robustly even in noisy domains. Action recognition in an unconstrained ofﬁce environment thus provides an excellent testbed for re- search on cognitive computer vision. In this contribution, we present a system that consists of several computational modules for object and action recognition. It applies atten- tion mechanisms, visual learning and contextual as well as probabilistic reasoning to fuse individual results and verify their consistency. Database technologies are used for infor- mation storage and an XML based communication frame- work integrates all modules into a consistent architecture. 1. Motivation and Scientiﬁc Context Although it has been part of the ﬁeld from the very be- ginnings [1], the concept of cognitive computer vision sys- tems has recently regained popularity. Its basic idea is to bring together and consolidate the achievements of artiﬁ- cial intelligence, automatic perception, machine learning and robotics. Consequently, Christensen [6] identiﬁes the following characteristics of cognitive vision: It involves the acquisition, storage, retrieval and use of knowledge. It is not an end in itself but guides a system’s perception and (re)action. Simultaneously, the capabilities to perceive and to act guide cognitive processes. Without perception and the possibility to manipulate or communicate perceived entities or events, knowledge cannot be acquired. Memory, how- ever, is a limited resource. Besides learning mechanisms, cognitive vision thus also implies attention control and a sense for relevance which comes along with the capabil- ity to forget irrelevant information. This requires ﬂexible knowledge representation as well as functionalities for con- textual reasoning and categorization. Together with the bio- logically motivated principle of multiple computations [9], categorization yields adaptability, ﬂexibility and robustness. In this paper, we will present ﬁrst results of a joint re- search project on cognitive vision [27]. Its goal is to in- vestigate and develop architectures and computational mod- els for visual active memories (VAMs). These are systems which evaluate given facts or gather and integrate contex- tual knowledge for visual analysis. VAMs can learn new concepts and categories as well as new spatio-temporal re- lations. They can adapt to unknown situations and may be scaled to different domains. Furthermore, the project investigates techniques for ad- vanced interactive retrieval. The aim is to provide VAM in- terfaces so that memory content becomes available to a user. As an example, Fig. 1 shows experiments with a prototype of a mobile interface. Working in an everyday ofﬁce envi- ronment, the user wears a head-mounted device equipped with cameras and a display. Information about recognized objects and results of user queries are visualized using aug- mented reality. On the other hand, by displaying status mes- sages and prompts the system can communicate with its user. This closes the perception-action cycle; asking for ma- nipulations of the environment in order to study their effects can accomplish interactive object and event learning. Figure 1. Head mounted cameras and display for augmented reality visualization of recog- nized objects and events in an ofﬁce. 0-7695-2158-4/04 $20.00 (C) 2004 IEEE