A HETEROGENEOUS DICTIONARY MODEL FOR REPRESENTATION AND RECOGNITION OF HUMAN ACTIONS Rushil Anirudh ⋆† , Karthikeyan Ramamurthy † , Jayaraman J. Thiagarajan † , Pavan Turaga ⋆† , Andreas Spanias † ⋆ School of Arts, Media & Engineering, † School of Electrical, Computer, & Energy Engineering, Arizona State University, Tempe, AZ. ABSTRACT In this paper, we consider low-dimensional and sparse repre- sentation models for human actions, that are consistent with how actions evolve in high-dimensional feature spaces. We ﬁrst show that human actions can be well approximated by piecewise linear structures in the feature space. Based on this, we propose a new dictionary model that considers each atom in the dictionary to be an afﬁne subspace deﬁned by a point and a corresponding line. When compared to centered clus- tering approaches such as K-means, we show that the pro- posed dictionary is a better generative model for human ac- tions. Furthermore, we demonstrate the utility of this model in efﬁcient representation and recognition of human activities that are not available in the training set. Index Terms— Dictionary learning, Sparse representa- tions, Activity analysis. 1. INTRODUCTION Sparse coding attempts to represent data vectors using a lin- ear combination of a small number of vectors chosen from a ‘dictionary’. The dictionary that leads to an optimal sparse representation can be either predeﬁned or learned from the training samples themselves. It is now well known that the latter can lead to improved representation and recognition re- sults [1, 2]. If the data is truly low-dimensional, sparse cod- ing can effectively identify its low degrees of freedom, and hence sparse models have proved successful in several inverse problems in signal/image processing [1], and computer vision [3]. When compared to classical subspace methods which are efﬁcient only if the data lies in a single low-dimensional subspace, sparse coding can recover data lying in a union of low-dimensional subspaces and hence provide a greater ﬂex- ibility in representation. Traditionally, most sparse coding applications deal with static data such as images, but there have been recent attempts to extend these concepts to videos [4, 5]. To this end, problems of activity analysis have gained lot of attention where typically a dictionary is learned either Corresponding author - ranirudh@asu.edu (Rushil Anirudh) Fig. 1: Here we show the feature evolution of Running, Talk on Phone and Waving. The features are projected to a lower dimen- sional space for visualization. The top ﬁgure shows the three actions on a common coordinate frame. It is seen that these structures can be well approximated by piece-wise linear models. per class of actions or on the entire set of all actions and sparse codes are generated per frame. Most human actions evolve over time where they usually begin with a rest pose and end in an extreme pose. This transition is smooth result- ing in smoothly varying features. The geometric structure of these transitions is not known in general, but attempts have been made to model this structure, e.g. actions have been considered to trace out non-linear manifolds in feature spaces [6]. While such models are quite rich and general, they are accompanied by difﬁculties in learning the model and coding data using the model. However, as shown in ﬁg 1, a simple piecewise linear model is sufﬁcient to represent most com- mon activities such as Waving, Running and Talking on the phone. In addition to the representational simplicity, this also affords solving the sparse-coding problem efﬁciently. In such cases, centered clustering approaches such as K- Means will not be able to effectively model the underlying