Pose-Based 3D Human Motion Analysis Using Extreme Learning Machine Arif Budiman Computer Science Faculty, University of Indonesia West Java, Indonesia 16424 Email: arif.budiman21@ui.ac.id Mohamad Ivan Fanany Computer Science Faculty, University of Indonesia West Java, Indonesia 16424 Email: ivan.fanany@cs.ui.ac.id Abstract—In 3D human motion pose-based analysis, the main problem is how to classify multi-class label activities based on primitive action (pose) inputs efficiently for both accuracy and processing time. Because, pose is not unique and the same pose can be anywhere on different activity classes. In this paper, we evaluate the effectiveness of Extreme Learning Machine (ELM) in 3D human motion analysis based on pose cluster. ELM has reputation as eager classifier with fast training and testing time but the classification result originally has still low testing accuracy even by increasing the hidden nodes number and adding more training data. To achieve better accuracy, we pursue a feature selection method to reduce the dimension of pose cluster training data in time sequence. We propose to use frequency of pose occurrence. This method is similar like bag of words which is a sparse vector of occurrence counts of poses in histogram as features for training data (bag of poses). By using bag of poses as the optimum feature selection, the ELM performance can be improved without adding network complexity (Hidden nodes number and training data). I. I NTRODUCTION Nowadays, the most challenging task in computer vision based human motion analysis is to understand and recognize human action in semantic interpretation (high level vision) rather than intermediate-level (human tracking) or low-level vision (Human detection) [12]. Semantic meaning of action recognition is actually a task to label or classify an activity motion as belongs to one of some meaningful action classes by using machine learning algorithms. The classifier should perform efficiently to give both well-acceptance accuracy and fast processing time especially for real time application. Extreme Learning Machine (ELM) has reputation as ”ex- treme” in processing speed machine learning [3], however, the effectiveness of ELM in human motion analysis is largely unknown. How ELM, as eager classifier with fast processing time reputation, can deal with human activity classification based on pose cluster with semantic meaning. Such prob- lem has reputation as hard classification problem due to pose cluster characteristic. Pose cluster in motion activity has feature characteristic for pose position in time sequence (spatiotemporal) and the frequency of pose occurrence in motion sequence. Both features are not unique and could be anywhere distributed on different activity classes. The distribution of features depends on the regularity of motion activity. For example, dance actions can be viewed as more regular than badminton sport actions. Dance action has a more strict rule followed by the dancer during her performance. In this paper, we evaluate the effectiveness of ELM classifier to deal with Balinese traditional dance and badminton sport. Both have different feature characteristic and it is difficult to find a deterministic function to select the optimal features and optimal classifier structure. ELM itself has two major issues need to be addressed to improve its accuracy [2], [5] : 1) The structure size of the number of hidden nodes. The optimal number is still unknown with trial-and-error. 2) Whether the computation complexity can be further reduced when given large number of training data and when large number of hidden nodes required. Our contribution is how the effective learning method of ELM to deal with 3D human motion pose-based classification problem to give not only processing speed but also well- acceptance accuracy by using the efficient network structure. II. RELATED WORKS The taxonomy of Human actions distinguishes actions into action primitive (called pose), action and activity [10]. Pose is a set of features of body part location and can be interpreted as meaningful string of symbol to describe the activity. Different activities may have more than one similar poses anywhere determined by pose location and the frequency of pose occurrence. Using Kinect as a motion capture device [7], the human motion was constructed from instance of skeleton features in 3D space. An instance of skeleton features constructs a key body pose. A sequence of key poses forms a basic (primitive) motion. Clustering of data observation in feature space is the most common approach to identify key poses and then classify it using common classifier such as nearest neighbors or support vector machine. Another hot issue classifier nowadays is Extreme Learning Machine (ELM). ELM has strong point in processing speed and suitable as real time classifier. ELM, firstly introduced by Huang [1], [9], has some important concepts: 1) ELM is a supervised learning classifier using a matrix based on the target and the correspondence input from training data. 2) ELM network architecture uses single Hidden Layer Feedforward Network (SLFN) on figure 1. There is no known a priori standard to determine the exact number of hidden nodes (L). Generally, according to