LEARNING INFORMATIVE PAIRWISE JOINTS WITH ENERGY-BASED TEMPORAL PYRAMID FOR 3D ACTION RECOGNITION Mengyuan Liu † , Chen Chen ‡ and Hong Liu †∗ † Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, China ‡ Center for Research in Computer Vision, University of Central Florida, USA liumengyuan@pku.edu.cn chenchen870713@gmail.com hongliu@pku.edu.cn ABSTRACT This paper presents an effective local spatial-temporal de- scriptor for action recognition from skeleton sequences. The unique property of our descriptor is that it takes the spatial- temporal discrimination and action speed variations into ac- count, intending to solve the problems of distinguishing sim- ilar actions and identifying actions with different speeds in one goal. The entire algorithm consists of two stages. First, a frame selection method is used to remove noisy skeletons for a given skeleton sequence. From the selected skeleton- s, skeleton joints are mapped to a high dimensional space, where each point refers to kinematics, time label and joint label of a skeleton joint. To encode relative relationships a- mong joints, pairwise points from the space are then jointly mapped to a new space, where each point encodes the rela- tive relationships of skeleton joints. Second, Fisher Vector (FV) is employed to encode all points from the new space as a compact feature representation. To cope with speed varia- tions in actions, an energy-based temporal pyramid is applied to form a multi-temporal FV representation, which is fed into a kernel-based extreme learning machine classiﬁer for recog- nition. Extensive experiments on benchmark datasets consis- tently show that our method outperforms state-of-the-art ap- proaches for skeleton-based action recognition. Index Terms— action recognition, skeleton sequence 1. INTRODUCTION Human action recognition plays an important role in applica- tions involving automatic analysis of human actions, such as intelligent surveillance, human-computer interaction and sign language analysis. An intuitive way to analyse human actions is to estimate human poses from 2D images, facing seman- tic ambiguities induced by cluttered backgrounds and loss of This work is supported by National High Level Talent Special Support ProgramNational Natural Science Foundation of China (NSFC, No.61340046, 61673030, 61672079, U1613209), Specialized Research Fund for the Doctoral Program of Higher Education (No.20130001110011), Nat- ural Science Foundation of Guangdong Province (No.2015A030311034), Scientiﬁc Research Project of Guangdong Province (No.2015B010919004). Hong Liu * is the Corresponding author. depth data [1]. The development of RGB-D cameras, in par- ticular the Kinect [2], opens up opportunities in addressing above problems [3, 4, 5, 6, 7]. With the implementation of capturing skeletons from Kinect in realtime [8], recent works focus on skeleton-based action recognition [9, 10]. It remains a key problem to efﬁciently describe the spatial-temporal skeleton joints for action recognition from skeleton sequences. Xia et al. assigned 3D joints to his- tograms of 3D joint locations (HOJ3D) by designing a global spherical coordinate system [11]. Vemulapalli et al. explicitly estimated relative 3D geometry between different body parts by the special Euclidean group SE(3) [12]. Generally, these methods encode the spatial distributions of 3D joints. How- ever, the temporal domain is unexplored, leading to the loss of motion information and temporal information of skeleton joints. To encode motion information, Yang et al. adopted the differences of joints in temporal and spatial domains to describe the dynamics of joints [13]. Zanﬁr et al. provided a non-parametric Moving Pose (MP) framework [14], which considers features including absolute position, speed and ac- celeration of each joint. To leverage temporal information, Evangelidis et al. divided a skeleton sequence into equal seg- ments, from which the skeletal quads features were extracted [15]. Hussein et al. enhanced the temporal pyramid struc- ture and divided a skeleton sequence into equal segments with overlapping [16]. Recently, Liu et al. [17] applied a Long Short Term Memory (LSTM) network with trust gates to learn the spatial-temporal information of joints. However, these methods still lack spatial-temporal dis- crimination of skeleton joints to distinguish similar action- s and suffer from the effect of action speed variations. To solve these problems, we propose a descriptive local spatial- temporal descriptor for skeleton joints and design an energy- based temporal pyramid to alleviate the problem of speed variations. The pipeline of extracting representation from a skeleton sequence is shown in Figure 1. Speciﬁcally, we ex- tend the concept of global skeleton kinematics in [14] and cal- culate joint kinematics to describe local dynamics of joints. Besides, our local descriptor provides a complete description of joints, including temporal cue, i.e., time label, spatial cue, 978-1-5090-6067-2/17/$31.00 c 2017 IEEE