Digital Object Identifier (DOI) 10.1007/s00791-004-0143-2 Comput Visual Sci 8: 19–25 (2005) Computing and Visualization in Science Regular article A study of motion recognition from video sequences Xiang Yu, Simon X. Yang Advanced Robotics and Intelligent Systems (ARIS) Lab, School of Engineering, University of Guelph, Guelph, N1G 2W1, Ontario, Canada (e-mail: syang@uoguelph.ca) Received: 15 March 2003 / Accepted: 4 January 2004 Published online: 17 August 2004 – Springer-Verlag 2004 Communicated by: G. Wittum Abstract. This paper proposes a method for recognizing hu- man motions from video sequences, based on the cognitive hypothesis that there exists a repertoire of movement prim- itives in biological sensory motor systems. First, a content- based image retrieval algorithm is used to obtain statisti- cal feature vectors from individual images. An unsupervised learning algorithm, self-organizing map, is employed to clus- ter these shape-based features. Motion primitives are recov- ered by searching the resulted time serials based on the min- imum description length principle. Experimental results of motion recognition from a 37 seconds video sequence show that the proposed approach can efficiently recognize the mo- tions, in a manner similar to human perception. 1 Introduction The analysis of human actions by a computer has gained more and more interest [3, 6, 7, 9, 12, 14, 16]. A significant part of this work is the recognition and modelling of human motio ns in video sequences, which provides a basis for applica- tions such as human/machine interaction, humanoid robotics, animation, video database search, sports medicine, etc. For human/machine interaction, it is highly desirable if the ma- chine can understand the human operator’s action and react correspondingly. The work for the remote control of cam- era view is a good example. An operator monitors a site by watching through a remote camera. The camera control sys- tem detects the movement of the operator’s head and eyes, estimates the interested view point of the operator and au- tomatically changes the camera’s view to that point so that the operator can see what he is interested simply by turning his head and eyes. The recognition of human motions is also important for humanoid robotics research. For example, imi- tation is a powerful means of skill acquisition for humanoid robots, i.e., a robot learns its motions by understanding and imitating the action of a human model [9]. Another applica- tion is video database search. The increasing interest in the understanding of action or behaviour has led to a shift from static images to video sequences in computer vision. A characteristic of the work on motion recognition is that it deals with a time series of video signals [12, 19]. A video sequence consists of many frames, which are the individual static images and generally the smallest unit we are con- cerned. A contiguous set of frames representing a continuous action in time and space is called a shot. Basically, video segmentation, or called video recognition, is the process of dividing a video sequence into its component shots. A conventional solution to human motion recognition is based on a kinematics model. For example, Sidenbladh et al. [16] introduced a human body model in which the hu- man body is represented by a collection of articulated limbs. In this case, an action is considered as a collection of time series describing the joint angles as they evolve over time. Then, the motion recognition task can be simplified, probably oversimplified, as a parameter estimation problem, i.e., to es- timate the joint angles over the images series. One problem of this approach is how to decompose a time series into suitable temporal primitives in order to model these body angles. Hidden Markov models (HMMs) have been well used for recognition of human action. Bregler [3] proposed a prob- abilistic decomposition of human dynamics at multiple ab- straction levels. At the low level, EM (expectation maximum) clustering is used to find coherent motion primitives. The middle level categories are simple movements represented by dynamic systems, and the high level complex gestures are represented by HMMs. The weakness of HMMs for mod- elling is that they do not well capture some of the intrinsic properties of biological motion such as smoothness. Instead, human motions are often represented by explicit temporal curves that describe the change over time of 3D joint angles. In [3], the topology of the HMMs is obtained by learning a hybrid dynamic model on periodic motion and it is difficult to be extended to other types of motions, which could be more complex. Still, manually segmenting and labelling training data, which are required in supervised learning approaches, are a tedious and error prone process. In this paper, we propose an unsupervised learning based approach to model and represent human motion in video se- quences, as illustrated in Fig. 1. The basic idea comes from the psychophysics and neuroscience evidences that motor