Real-time Human Activity Recognition using
External and Internal Spatial Features
Zaw Zaw Htike, Simon Egerton, Kuang Ye Chow
School of Information Technology, Monash University
Sunway Campus, Malaysia
{zaw, simon.egerton}@infotech.monash.edu.my, kuang.ye.chow@eng.monash.edu.my
Abstract— Human activity recognition has become very popular
in the field of computer vision. In this paper, we present a simple,
robust and computationally efficient algorithm, architecture and
implementation to recognise and classify human activities in real-
time using very few training data. We employ a spatio-temporal
representation of human activities by combining trajectory
information and invariant spatial information of the subjects.
Activities are classified by a support vector machine (SVM) with
a radial basis kernel. Optimal parameters for the SVM are found
through a 10-fold cross-validation. Experimental results
demonstrate that the proposed system is effective and efficient.
When tested on the Weizmann dataset, the system achieves a
recognition rate above 90% for one-shot learning which is above
benchmark scores in accordance with the literature. The system
is also found to be robust against noise, deformation and
variation in viewpoints. The system is feasible to operate
efficiently in real-time and deployable in intelligent
environments.
Keywords-Human Activity Recognition; Real-time
I. INTRODUCTION
Human activity recognition is still an unsolved problem in
artificial intelligence and computer vision. It could be applied
to a vast array of domains such as video surveillance systems,
video indexing and retrieval systems, automatic sports
commentary systems, human computer interaction systems,
context-aware pervasive systems and human centred
applications. There are a number of reasons why human
activity recognition is hard. Firstly, a human body is non-rigid
and has high degrees of freedom, generating infinitesimal
variations in basic movements. Secondly, no two persons are
identical in terms of body shape, volume and coordination of
muscle contractions, making each person articulate unique
kinematics. In fact, many researchers exploit the property of
those unique variations in gait style among individuals to be
used as a biometric identifier to identify and recognise different
people [1-2]. The above mentioned problems get further
compounded by infinitesimal variations in viewpoints,
illumination, shadow, self-occlusion, deformation, noise,
clothing and so on. Thirdly, since there are an infinite number
of unknown activities and their resulting combinations, data
collection is not so easy. In order to address those problems, an
activity representation resilient to spatial and temporal
variations is required. We also need a model that learns human
activities from very few training examples. In other words, a
feature vector from one training sample should generalise well
to that of other similar activities. The system should also be
robust enough to work with low resolution data and low signal-
to-noise ratio.
II. RELATED WORK
Researchers have been attempting to solve human activity
recognition using a variety of techniques. Youngwook and
Hao [3] make use of a Doppler radar as an input device and
classify signals using a SVM. Zhenyu and Lianwen [4]
propose a recognition system based on a single tri-axis
accelerometer. Some researchers [5-7] make use of multi-
modal sensors and RFID devices in sensor-enabled ubiquitous
environments. However, those systems are neither cost-
effective nor practical to be deployed outside laboratories.
Most researchers, therefore, resort to using standard cameras
as input-devices because the same systems could be applied to
other domains such as video indexing and retrieval.
Any recognition system can be divided into two main parts:
feature extraction and feature classification [8]. There have
been numerous approaches and attempts of feature extraction
and feature classification in the literature of activity
recognition.
A. Feature extraction
For the feature extraction portion, most approaches in the
literature fall into three major categories: model-based
approaches, model-less approaches and hybrid approaches.
Model-based feature-extraction approaches attempt to
recover structural information of the human body and
construct a kinematical model of human motion. Features
related to relative displacement, velocity and acceleration of
head, torso, limbs and joints are extracted. Therefore as many
possible body parts are needed to be identified and extracted
from each image blob. In the presence of strong noise and
occlusion, however, extraction of body parts may not be that
straightforward. Additionally, the observed kinematics of the
human body might be dependent upon the relative position of
the camera. Boeheim [9] proposes a thinning algorithm
operating on the silhouette to produce a six-segment
representation of human figure. Limb parameters, such as
distance from torso, and angle of displacement from the
vertical axis, are derived to form a feature vector. Jen-Hui et
al. [10] represent a novel approach to extract torso-less model
2010 Sixth International Conference on Intelligent Environments
978-0-7695-4149-5/10 $26.00 © 2010 IEEE
DOI 10.1109/IE.2010.12
24
2010 Sixth International Conference on Intelligent Environments
978-0-7695-4149-5/10 $26.00 © 2010 IEEE
DOI 10.1109/IE.2010.12
24
2010 Sixth International Conference on Intelligent Environments
978-0-7695-4149-5/10 $26.00 © 2010 IEEE
DOI 10.1109/IE.2010.12
52
2010 Sixth International Conference on Intelligent Environments
978-0-7695-4149-5/10 $26.00 © 2010 IEEE
DOI 10.1109/IE.2010.99
52
2010 Sixth International Conference on Intelligent Environments
978-0-7695-4149-5/10 $26.00 © 2010 IEEE
DOI 10.1109/IE.2010.17
52