Using Discrete Cosine Transform Based Features
for Human Action Recognition
Tasweer Ahmad and Junaid Rafique
Electrical Engineering Department, Government College University, Lahore, Pakistan
Email: tasveer.ahmad@gcu.edu.pk, junaidumtee70@gmail.com
Hassam Muazzam
Electrical Engineering Department, University of Punjab, Lahore, Pakistan
Email: hassammuazzam@hotmail.com
Tahir Rizvi
Dipartimento di Automatica e Informatica, Politecnico di Torino, Turin, Italy
Email: Syed.rizvi@polito.it
Abstract—Recognizing human action in complex video
sequences has always been challenging for researchers due
to articulated movements, occlusion, background clutter,
and illumination variation. Human action recognition has
wide range of applications in surveillance, human computer
interaction, video indexing and video annotation. In this
paper, a discrete cosine transform based features have been
exploited for action recognition. First, motion history image
is computed for a sequence of images and then blocked-
based truncated discrete cosine transform is computed for
motion history image. Finally, K-Nearest Neighbor (K-NN)
classifier is used for classification. This technique exhibits
promising results for KTH and Weizmann dataset.
Moreover, the proposed model appears to be
computationally efficient and immune to illumination
variations; however, this model is prone to viewpoint
variations.
Index Terms—motion history image, discrete cosine
interaction, video indexing, video annotation
I. INTRODUCTION
The task of Human Action recognition has always
been challenging and fascinating for computer vision
scientists and researchers within last two decades years.
Human Action recognition has found numerous
applications in video surveillance, motion tracking, scene
modelling and behavior understanding [1]. Intelligent and
effective Human Action recognition has received a lot of
attention and funding due to rapidly increasing security
concerns and effective surveillance of public places such
as airports, bus stations, railway stations, shopping malls
etc. [1]. Human Action recognition systems can also be
deployed at health-care centers, day-care centers, and old
homes for monitoring and for fall detection. Human
Computer Interaction (HCI), using action recognition,
finds ample of applications in interactive and gaming
Manuscript received February 3, 2015; revised August 25, 2015.
environment [2]. R. T. Collins et al. in 2000 [3]
suggested that video surveillance can be widely
categorized as human detection and tracking, human
motion analysis and activity recognition. At that time,
they further suggested that “...activity analysis will be the
most important area of future research in video
surveillance.” Now, this projection seems true as a large
number of research articles have been published in this
domain over the last decade. Although surveillance
cameras and monitoring systems are quite prevalent and
affordable, but still it is very challenging to devise a
robust surveillance systems due to human factors like
fatigue and boredom.
It is highly desirable to devise such an intelligent
system that can recognize common human actions with
remarkable accuracy, multi-scale resolution and minimal
computational complexity. A lot of efforts have been
made by computer vision researchers to overcome these
challenges. A survey by [4] highlights the importance and
applications of Intelligent Video Systems and Analytics
(IVA). In this survey, both system analytics and
theoretical analytics have been targeted. Video system
hardware is being developed at faster rate due to digital
signal processors and VLSI Design, but still hardware-
oriented issues are unresolved due to system scalability,
compatibility and real-time performance [5]. Theoretical
Analytics deal with more robust and computationally
efficient algorithms.
Another breakthrough came in human action
recognition by the introduction of multiple cameras for
rendering Multi-View Videos for pose estimation and
activity recognition. The performance of such systems
drastically ameliorated when videos were accessed from
multiple cameras [6]. The price paid for multi-channel
video was computational complexity; certainly there must
be compromise between performance and complexity of
the system.
Now-a-days, Infra-Red (IR) Sensor based monocular
cameras are widely spread for video gaming and human
Journal of Image and Graphics, Vol. 3, No. 2, December 2015
©2015 Journal of Image and Graphics 96
doi: 10.18178/joig.3.2.96-101
transform, K-nearest neighbor, human computer