Real-time Human Activity Recognition using External and Internal Spatial Features Zaw Zaw Htike, Simon Egerton, Kuang Ye Chow School of Information Technology, Monash University Sunway Campus, Malaysia {zaw, simon.egerton}@infotech.monash.edu.my, kuang.ye.chow@eng.monash.edu.my Abstract— Human activity recognition has become very popular in the field of computer vision. In this paper, we present a simple, robust and computationally efficient algorithm, architecture and implementation to recognise and classify human activities in real- time using very few training data. We employ a spatio-temporal representation of human activities by combining trajectory information and invariant spatial information of the subjects. Activities are classified by a support vector machine (SVM) with a radial basis kernel. Optimal parameters for the SVM are found through a 10-fold cross-validation. Experimental results demonstrate that the proposed system is effective and efficient. When tested on the Weizmann dataset, the system achieves a recognition rate above 90% for one-shot learning which is above benchmark scores in accordance with the literature. The system is also found to be robust against noise, deformation and variation in viewpoints. The system is feasible to operate efficiently in real-time and deployable in intelligent environments. Keywords-Human Activity Recognition; Real-time I. INTRODUCTION Human activity recognition is still an unsolved problem in artificial intelligence and computer vision. It could be applied to a vast array of domains such as video surveillance systems, video indexing and retrieval systems, automatic sports commentary systems, human computer interaction systems, context-aware pervasive systems and human centred applications. There are a number of reasons why human activity recognition is hard. Firstly, a human body is non-rigid and has high degrees of freedom, generating infinitesimal variations in basic movements. Secondly, no two persons are identical in terms of body shape, volume and coordination of muscle contractions, making each person articulate unique kinematics. In fact, many researchers exploit the property of those unique variations in gait style among individuals to be used as a biometric identifier to identify and recognise different people [1-2]. The above mentioned problems get further compounded by infinitesimal variations in viewpoints, illumination, shadow, self-occlusion, deformation, noise, clothing and so on. Thirdly, since there are an infinite number of unknown activities and their resulting combinations, data collection is not so easy. In order to address those problems, an activity representation resilient to spatial and temporal variations is required. We also need a model that learns human activities from very few training examples. In other words, a feature vector from one training sample should generalise well to that of other similar activities. The system should also be robust enough to work with low resolution data and low signal- to-noise ratio. II. RELATED WORK Researchers have been attempting to solve human activity recognition using a variety of techniques. Youngwook and Hao [3] make use of a Doppler radar as an input device and classify signals using a SVM. Zhenyu and Lianwen [4] propose a recognition system based on a single tri-axis accelerometer. Some researchers [5-7] make use of multi- modal sensors and RFID devices in sensor-enabled ubiquitous environments. However, those systems are neither cost- effective nor practical to be deployed outside laboratories. Most researchers, therefore, resort to using standard cameras as input-devices because the same systems could be applied to other domains such as video indexing and retrieval. Any recognition system can be divided into two main parts: feature extraction and feature classification [8]. There have been numerous approaches and attempts of feature extraction and feature classification in the literature of activity recognition. A. Feature extraction For the feature extraction portion, most approaches in the literature fall into three major categories: model-based approaches, model-less approaches and hybrid approaches. Model-based feature-extraction approaches attempt to recover structural information of the human body and construct a kinematical model of human motion. Features related to relative displacement, velocity and acceleration of head, torso, limbs and joints are extracted. Therefore as many possible body parts are needed to be identified and extracted from each image blob. In the presence of strong noise and occlusion, however, extraction of body parts may not be that straightforward. Additionally, the observed kinematics of the human body might be dependent upon the relative position of the camera. Boeheim [9] proposes a thinning algorithm operating on the silhouette to produce a six-segment representation of human figure. Limb parameters, such as distance from torso, and angle of displacement from the vertical axis, are derived to form a feature vector. Jen-Hui et al. [10] represent a novel approach to extract torso-less model 2010 Sixth International Conference on Intelligent Environments 978-0-7695-4149-5/10 $26.00 © 2010 IEEE DOI 10.1109/IE.2010.12 24 2010 Sixth International Conference on Intelligent Environments 978-0-7695-4149-5/10 $26.00 © 2010 IEEE DOI 10.1109/IE.2010.12 24 2010 Sixth International Conference on Intelligent Environments 978-0-7695-4149-5/10 $26.00 © 2010 IEEE DOI 10.1109/IE.2010.12 52 2010 Sixth International Conference on Intelligent Environments 978-0-7695-4149-5/10 $26.00 © 2010 IEEE DOI 10.1109/IE.2010.99 52 2010 Sixth International Conference on Intelligent Environments 978-0-7695-4149-5/10 $26.00 © 2010 IEEE DOI 10.1109/IE.2010.17 52