International Journal of Computer Vision 38(1), 35–44, 2000 c  2000 Kluwer Academic Publishers. Manufactured in The Netherlands. Learning to Recognize Visual Dynamic Events from Examples MASSIMILIANO PITTORE INFM-DISI, Universit` a di Genova, Genova, Italy pittore@disi.unige.it MARCO CAMPANI INFM-DIFI, Universit` a di Genova, Genova, Italy campani@fisica.unige.it ALESSANDRO VERRI Center for Biological and Computational Learning, MIT, Cambridge, MA, USA; INFM-DISI, Universit` a di Genova, Genova, Italy verri@ai.mit.edu Abstract. This paper describes a trainable and flexible system able to recognize visual dynamic events, e.g. movements performed by different people, from a stream of images taken by a fixed camera. Each event is represented by a feature vector built from the spatio-temporal changes detected in the observed image sequence. The system neither attempts to recover the 3D structure nor assumes a prior model of the observed dynamic events. During training a supervisor identifies and labels the events of interest among those automatically detected by the system. At run time, previously unseen events are detected and classified on the basis of the available examples. Several experiments on real images are reported and the benefits of using Support Vector Machines for performing effective classification from a relatively small number of labeled examples and for building noise tolerant representations are discussed. Preliminary results indicate that the proposed system can also be applied with equally good results to the case in which the dynamic events are gestures performed by different people. Keywords: dynamic events, pattern recognition, Support Vector Machines, computer vision systems 1. Introduction Resting on the assumption that trainability is going to play an increasingly important role for building vision- based systems, this paper aims at showing that rela- tively complex visual tasks can be effectively learned without acquiring sophisticated models and recon- structing the 3D spatial structure. We consider the problem of recognizing visual dy- namic events from a stream of images taken by a fixed camera and describe a simple system which, with minor changes, can be adapted to a range of different applica- tions, including surveillance and monitoring of people movements in indoor scenes and gesture recognition. Depending on the application, the considered dynamic event can be described in words as “somebody or some- thing is moving from A to B ” or “this gesture is being performed”. Each dynamic event is represented by a feature vector built from the sequence of spatio-tempo- ral changes detected in the observed image stream with no explicit reconstruction of the observed 3D spatial structure. In the current version of the system, the prob- lem of dealing with events of different time duration is solved by a global scaling of the time axis. Instead of assuming a prior model of the observed events, during training a supervisor identifies and labels