Arm Gesture Recognition using a Convolutional Neural Network Eirini Mathe 1 , Alexandros Mitsou 3 , Evaggelos Spyrou 1,2,3 and Phivos Mylonas 4 1 Institute of Informatics and Telecommunications National Center for Scientific Research – “Demokritos,” Athens, Greece 2 Department of Computer Engineering, Technological Education Institute of Sterea Ellada, Lamia, Greece 3 Department of Computer Science, University of Thessaly, Lamia, Greece 4 Department of Informatics, Corfu, Greece email: emathe@iit.demokritos.gr, amitsou95@gmail.com, espyrou@iit.demokritos.gr, fmylonas@ionio.gr Abstract—In this paper we present an approach towards arm gesture recognition that uses a Convolutional Neural Network (CNN), which is trained on Discrete Fourier Transform (DFT) images that result from raw sensor readings. More specifically, we use the Kinect RGB and depth camera and we capture the 3D positions of a set of skeletal joints. From each joint we create a signal for each 3D coordinate and we concatenate those signals to create an image, the DFT of which is used to describe the gesture. We evaluate our approach using a dataset of hand gestures involving either one or both hands simultaneously and compare the proposed approach to another that uses hand-crafted features. I. I NTRODUCTION Poses and gestures are one the basic means of com- munication between humans while they may also play a crucial role in human-computer interaction, as they are able to transfer some kind of meaning. The research area of pose and gesture recognition aims to recognizing such expressions, which typically involve some posture and/or motion of the hands, arms, head, or even skeletal joints of the whole body. In certain cases, meaning may differ, based on a facial expression. Several application areas may benefit from the recognition of a human’s pose or the gestures she/he performs, such as sign language recognition, gaming, medical applications involving the assessment of a human’s condition and even navigation in virtual reality environments. There exist various approaches and techniques which involve some kind of sensor, either “worn” by the subject (e.g., accelerometers, gyroscopes etc.), or monitoring the subject’s motion (e.g., cameras). In the latter case, the subject may also wear special “markers” which are used to assist to the identification of several body parts and/or skeletal joints. However, during the last few years several approaches rely solely on a typical RGB camera, enhanced by depth information. One such example is the well-known Kinect 1 sensor. Therefore, the user simply needs to stand in front of the camera without wearing any kind of external equipment. Several parts of her/his body are continuously detected and tracked in the 3D space. Typically, features are extracted and are used for training models to recognize poses and/or gestures. In this paper, we present a gesture recognition approach that focuses on arm gestures. By “arm gesture” we define any 1 https://developer.microsoft.com/en-us/windows/kinect gesture that involves palms, wrists, shoulders and/or elbows. We propose a novel deep learning architecture that uses a Convolutional Neural Network (CNN). More specifically, we use the Kinect sensor and its Software Development Kit (SDK) in order to detect and track the subject’s skeletal joints in the 3D space. We then select a subset of these joints, i.e., all that are involved at any of the gestures of our data set. Then, we create an artificial image based on these 3D coordinates. We apply the Discrete Fourier Transform on these images and use the resulting ones to train the CNN. We compare our approach with previous work [19], where a set of hand-crafted statistical features on joint trajectories had been used. Finally, we demonstrate that it is possible to efficiently recognize arm gestures without the need of a feature extraction step. Evaluation takes place using a new dataset of 10 arm gestures. The rest of this paper is organized as follows: Section II presents related research works in the area of gesture recognition using skeletal data, focusing on those that are based on deep learning. Section III presents the Kinect sensor and its SDK which was used for the extraction of raw data, the methodology for converting those data to the input image and also the proposed CNN that was used for arm gesture recognition. The dataset and the experimental results are presented in Section IV. Finally, in Section V, we draw our conclusions and discuss plans for future work. II. RELATED WORK The problem of arm gesture recognition has attracted many research efforts during the last decade. In this section our goal is to present works that aim to recognize simple arm gestures, i.e., as the ones we also aim to recognize in the context of this work. We shall present both approaches that use traditional machine learning techniques and also approaches that are based on deep learning. Feature-based works typically rely on traditional machine learning approaches such as artificial neural networks (ANNs), support vector machines (SVMs), decision trees (DTs) or K- nearest neighbor classifiers (KNN). In [3] SVMs and DTs are trained on 3D skeletal joint coordinates, while in [10] distances in reference to the Spine Center are used as features with a KNN classifier. Within the approach of [13], cascades of ANNs are used to firstly classify the gesture side (left/right) and then recognize the gesture type. In the work of [14] SVMs are used to recognize distinctive key poses, while 978-1-5386-8225-8/18/$31.00 c 2018 IEEE