A Visual-Based Gesture Prediction Framework Applied in Social Robots Bixiao Wu, Junpei Zhong, Senior Member, IEEE, and Chenguang Yang, Senior Member, IEEE Abstract—In daily life, people use their hands in various ways for most daily activities. There are many applications based on the position, direction, and joints of the hand, including gesture recognition, gesture prediction, robotics and so on. This paper proposes a gesture prediction system that uses hand joint coordinate features collected by the Leap Motion to predict dynamic hand gestures. The model is applied to the NAO robot to verify the effectiveness of the proposed method. First of all, in order to reduce jitter or jump generated in the process of data acquisition by the Leap Motion, the Kalman filter is applied to the original data. Then some new feature descriptors are introduced. The length feature, angle feature and angular velocity feature are extracted from the filtered data. These features are fed into the long-short time memory recurrent neural network (LSTM-RNN) with different combinations. Experimental results show that the combination of coordinate, length and angle features achieves the highest accuracy of 99.31%, and it can also run in real time. Finally, the trained model is applied to the NAO robot to play the finger-guessing game. Based on the predicted gesture, the NAO robot can respond in advance. Index Terms—Finger-guessing game, gesture prediction, human- robot interaction, long-short time memory recurrent neural network (LSTM-RNN), social robot. I. Introduction C URRENTLY, computers are becoming more and more popular, and the demand for human-robot interaction is increasing. People pay more attention to research of new technologies and methods applied to human-robot interactions [1]–[3]. Making human-robot interaction as natural as daily human-human interaction is the ultimate goal. Gestures have always been considered an interactive technology that can provide computers with more natural, creative and intuitive methods. Gestures have different meanings in different disciplines. In terms of interaction design, the difference between using gestures and using a mouse and keyboard, etc., is obvious, i.e., gestures are more acceptable to people. Gestures are comfortable and less limited by interactive devices, and they can provide more information. Compared with traditional keyboard and mouse control methods, the direct control of the computer by hand movement has the advantages of being natural and intuitive. Gesture recognition [4] refers to the process of recognizing the representation of dynamic or static gestures and translating them into some meaningful instructions. It is an extremely significant research direction in the area of human-robot interaction technology. The method of realizing gesture recognition can be divided into two types: visual-based [5], [6] gesture recognition and non-visual-based gesture recognition. The study of non-vision approaches began in the 1970s. Non-vision methods always take advantage of wearable devices [7] to track or estimate the orientation and position of fingers and hands. Gloves are very common devices in this field, and they contain the sensory modules with a wired interface. The advantage of gloves is that their data do not need to be preprocessed. Nevertheless, they are very expensive for virtual reality applications. They also have wires, which makes them uncomfortable to wear. With the development of technology, current research on non-visual gesture recognition is mainly focused on EMG signals [8]–[11]. However, EMG signals are greatly affected by noise, which makes it is difficult to process. Gesture recognition is based on vision and is less intrusive and contributes to a more natural interaction. It refers to the use of cameras [12]–[16], such as Kinect [17], [18] and Leap Motion [19], [20], to capture images of gestures. Then some algorithms are used to analyze and process the acquired data to get gesture information, so that the gesture can be recognized. It is also more natural and easy to use, becoming the mainstream way of gesture recognition. However, it is also a very challenging problem. By using the results of gesture recognition, the subsequent gesture of performers can be predicted. This process could be called gesture prediction, and it has wider applications. In recent years, with the advent of deep learning, many deep neural networks (DNN) are applied to gesture prediction. Zhang et al. [21] used an RNN model to predict gestures from raw sEMG signals. Wei et al. [22] combined a 3D convolu- tional residual network and bidirectional LSTM network to recognize dynamic gesture. Kumar et al. [23] proposed a multimodal framework based on hand features captured from Kinect and Leap Motion sensors to recognize gestures, using a Manuscript received April 18, 2021; revised May 23, 2021 and June 5, 2021; accepted June 22, 2021. This work was supported in part by National Nature Science Foundation of China (NSFC) (U20A20200, 61861136009), in part by Guangdong Basic and Applied Basic Research Foundation (2019B1515120076, 2020B1515120054), in part by Industrial Key Technologies R & D Program of Foshan (2020001006308). Recommended by Associate Editor Hui Yu. (Corresponding author: Chenguang Yang.) Citation: B. X. Wu, J. P. Zhong, and C. G. Yang, “A visual-based gesture prediction framework applied in social robots,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 3, pp. 510–519, Mar. 2022. B. X. Wu and C. G. Yang are with the College of Automation Science and Engineering, South China University of Technology, Guangzhou 510640, China (e-mail: wubixiao1997@163.com; cyang@ieee.org). J. P. Zhong is with the Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou 511442, China (e-mail: jonizhong@scut.edu.cn). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JAS.2021.1004243 510 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 9, NO. 3, MARCH 2022