Stacked Denoising Autoencoder for Feature Representation Learning in Pose-Based Action Recognition Arif Budiman * , Mohamad Ivan Fanany * , Chan Basaruddin * * Faculty of Computer Science, University of Indonesia Depok, West Java Indonesia Email: arif.budiman21@ui.ac.id, {ivan, chan}@cs.ui.ac.id Abstract—In this paper, we studied Stacked Denoising Auto- encoder(SDA) model for Human pose-based action recognition. We used public dataset Chalearn 2013 which contains Italian body language actions from 27 persons. We studied two model of SDA for pose clustering: 1) Traditional SDA with epoch and Neural Network supervised classiﬁer and 2) Marginalized SDA which faster and ELM supervised classiﬁer. We used supervised classiﬁer by using initial cluster data from K-means. We deployed global tuning that updating the weight during iterative learning. I. I NTRODUCTION Continuous human action recognition is one of the most challenging tasks in machine learning since the combinations on which, when and how the human skeleton joints moved is hugely unknown. All the joints movement sequentially composed of multiple atomic actions named human pose or snapshot of joints positions. The action classiﬁer takes the streaming pose data into certain label class of action which has semantic meaning, e.g., run, walk, and raise hand actions. Pose-Based approach in human action recognition has beneﬁt to reduce complexity of a continuous general human action recognition. In pose-based human action recognition, the features representation organized in hierarchical level from skeleton joints to poses and then poses to actions. Pose-based features has outperformed than non pose-based (appearance features) [1] and commonly used with motion capture device [2], [3]. Pose-based needs two level of learning. The Lower level is unsupervised clustering to build pose features repre- sentation, and the higher level is supervised learning to classify pose features to actions. A widely used method for basic clustering is K-means. The aim of clustering is to discover the natural groupings of a set of patterns, or objects unsupervisedly [4]. However, when the number of poses growing exponentially, K-means clustering is no longer effective. K-means has some limitations. K-means computational complexity is O(n∗K ∗I ∗d), where n = number of points, K = number of clusters, I = number of iterations, d = number of attributes. K-means has sensitivity against initial clustering conditions (empty clusters/No member) and problems when clusters are of differing sizes, densities, non- globular shapes and problems with outliers [5]. Dealing with these difﬁculties, we studied the concept of deep learning to improve cluster analysis for learning feature representation. Deep learning are based on distributed representations with the underlying assumption that observed data is generated by the interactions of many factors on different organization levels of abstraction or composition. It has been reinvented by neural network communities and gained its popularity recently as a way of learning deep and hierarchical artiﬁcial neural networks [6], [7]). Human action recognition also beneﬁted by deep learning. Baccouche proposed extension of 2D convolutional neural networks to 3D that automatically learns spatio-temporal features in non pose-based human action recognition using KTH video data set [8]. One of the unsupervised method is Stacked Denoising Autoencoder (SDA) introduced by Vincent [9]. The autoen- coder learns from a distributed representation (encoding) for a set of data, and then reconstruct the data back to themselves from the encoder (decoding). The new output is a compacted or sparse representation as an an input for the next autoencoder or another machine learning. Autoencoder has beneﬁts for di- mensionality reduction and cluster analysis. However, based on the experiments on NORB and CIFAR data set by using sparse autoencoders, sparse RBMs, K-means clustering, and Gaussian mixture models, Coates, et. al explained the best results are achieved using K-means clustering which is extremely fast, has no hyper-parameters to tune beyond the model structure itself, and is very easy to implement [10]. In this paper, we studied Stacked Denoising Autoencoder (SDA) model for Human pose-based action recognition. We used public dataset Chalearn 2013 [11] which contains Italian body language actions from 27 person. We studied two model of SDA for pose clustering: 1) Traditional SDA with epoch and Neural Network supervised classiﬁer [12] and 2) Marginalized SDA which faster [13] and ELM supervised classiﬁer. Differ- ent with [14], we used supervised classiﬁer by using initial cluster data from K-means. We deployed global tuning which updating the weight during iterative learning. II. THE CONCEPT OF STACKED DENOISING AUTOENCODER The autoencoder is a feedforward, non-recurrent neural network (multilayer perceptron), with an input layer, an output layer and one or more hidden layers. The output layer has equally many nodes as the input layer. Autoencoder is trained to reconstruct its own inputs. A denoising autoencoder (DAE) as extension of autoencoder is trained to reconstruct a clean ’re- 2014 IEEE 3rd Global Conference on Consumer Electronics (GCCE) 978-1-4799-05145-1/14/$31.00 ©2014 IEEE 684