Stacked Denoising Autoencoder for Feature
Representation Learning in Pose-Based Action
Recognition
Arif Budiman
*
, Mohamad Ivan Fanany
*
, Chan Basaruddin
*
*
Faculty of Computer Science, University of Indonesia
Depok, West Java Indonesia
Email: arif.budiman21@ui.ac.id, {ivan, chan}@cs.ui.ac.id
Abstract—In this paper, we studied Stacked Denoising Auto-
encoder(SDA) model for Human pose-based action recognition.
We used public dataset Chalearn 2013 which contains Italian
body language actions from 27 persons. We studied two model
of SDA for pose clustering: 1) Traditional SDA with epoch and
Neural Network supervised classifier and 2) Marginalized SDA
which faster and ELM supervised classifier. We used supervised
classifier by using initial cluster data from K-means. We deployed
global tuning that updating the weight during iterative learning.
I. I NTRODUCTION
Continuous human action recognition is one of the most
challenging tasks in machine learning since the combinations
on which, when and how the human skeleton joints moved
is hugely unknown. All the joints movement sequentially
composed of multiple atomic actions named human pose or
snapshot of joints positions. The action classifier takes the
streaming pose data into certain label class of action which
has semantic meaning, e.g., run, walk, and raise hand actions.
Pose-Based approach in human action recognition has
benefit to reduce complexity of a continuous general human
action recognition. In pose-based human action recognition,
the features representation organized in hierarchical level from
skeleton joints to poses and then poses to actions. Pose-based
features has outperformed than non pose-based (appearance
features) [1] and commonly used with motion capture device
[2], [3]. Pose-based needs two level of learning. The Lower
level is unsupervised clustering to build pose features repre-
sentation, and the higher level is supervised learning to classify
pose features to actions.
A widely used method for basic clustering is K-means. The
aim of clustering is to discover the natural groupings of a set
of patterns, or objects unsupervisedly [4]. However, when the
number of poses growing exponentially, K-means clustering is
no longer effective. K-means has some limitations. K-means
computational complexity is O(n∗K ∗I ∗d), where n = number
of points, K = number of clusters, I = number of iterations,
d = number of attributes. K-means has sensitivity against
initial clustering conditions (empty clusters/No member) and
problems when clusters are of differing sizes, densities, non-
globular shapes and problems with outliers [5].
Dealing with these difficulties, we studied the concept
of deep learning to improve cluster analysis for learning
feature representation. Deep learning are based on distributed
representations with the underlying assumption that observed
data is generated by the interactions of many factors on
different organization levels of abstraction or composition.
It has been reinvented by neural network communities and
gained its popularity recently as a way of learning deep
and hierarchical artificial neural networks [6], [7]). Human
action recognition also benefited by deep learning. Baccouche
proposed extension of 2D convolutional neural networks to
3D that automatically learns spatio-temporal features in non
pose-based human action recognition using KTH video data
set [8].
One of the unsupervised method is Stacked Denoising
Autoencoder (SDA) introduced by Vincent [9]. The autoen-
coder learns from a distributed representation (encoding) for a
set of data, and then reconstruct the data back to themselves
from the encoder (decoding). The new output is a compacted
or sparse representation as an an input for the next autoencoder
or another machine learning. Autoencoder has benefits for di-
mensionality reduction and cluster analysis. However, based on
the experiments on NORB and CIFAR data set by using sparse
autoencoders, sparse RBMs, K-means clustering, and Gaussian
mixture models, Coates, et. al explained the best results are
achieved using K-means clustering which is extremely fast,
has no hyper-parameters to tune beyond the model structure
itself, and is very easy to implement [10].
In this paper, we studied Stacked Denoising Autoencoder
(SDA) model for Human pose-based action recognition. We
used public dataset Chalearn 2013 [11] which contains Italian
body language actions from 27 person. We studied two model
of SDA for pose clustering: 1) Traditional SDA with epoch and
Neural Network supervised classifier [12] and 2) Marginalized
SDA which faster [13] and ELM supervised classifier. Differ-
ent with [14], we used supervised classifier by using initial
cluster data from K-means. We deployed global tuning which
updating the weight during iterative learning.
II. THE CONCEPT OF STACKED DENOISING
AUTOENCODER
The autoencoder is a feedforward, non-recurrent neural
network (multilayer perceptron), with an input layer, an output
layer and one or more hidden layers. The output layer has
equally many nodes as the input layer. Autoencoder is trained
to reconstruct its own inputs. A denoising autoencoder (DAE)
as extension of autoencoder is trained to reconstruct a clean ’re-
2014 IEEE 3rd Global Conference on Consumer Electronics (GCCE)
978-1-4799-05145-1/14/$31.00 ©2014 IEEE 684