978-1-5386-2497-5/18/$31.00 ©2018 IEEE Deep Learning based Machine Vision: ﬁrst steps towards a hand gesture recognition set up for Collaborative Robots Cristina Nuzzi, Simone Pasinetti, Matteo Lancini, Franco Docchio, Giovanna Sansoni Department of Mechanical and Industrial Engineering University of Brescia, Via Branze 38, Brescia, Italy c.nuzzi@unibs.it Abstract—In this paper, we present a smart hand gesture recognition experimental set up for collaborative robots using a Faster R-CNN object detector to ﬁnd the accurate position of the hands in the RGB images taken from a Kinect v2 camera. We used MATLAB to code the detector and a purposely designed function for the prediction phase, necessary for detecting static gestures in the way we have deﬁned them. We performed a number of experiments with different datasets to evaluate the performances of the model in different situations: a basic hand gestures dataset with four gestures performed by the combination of both hands, a dataset where the actors wear skin-like color clothes while performing the gestures, a dataset where the actors wear light-blue gloves and a dataset similar to the ﬁrst one but with the camera placed close to the operator. The same tests have been conducted in a situation where also the face of the operator was detected by the algorithm, in order to improve the prediction accuracy. Our experiments show that the best model accuracy and F1-Score are achieved by the complete model without the face detection. We tested the model in real-time, achieving good performances that can lead to real-time human-robot interaction, being the inference time around 0.2 seconds. Index Terms—collaborative robots, machine vision, deep learn- ing, hand gesture recognition, Faster R-CNN, MATLAB I. I NTRODUCTION Robotic systems are nowadays a fundamental part of the industrial world. Numerous types of these systems are available on the market, and often new innovative prototypes are designed depending on the required task. The wide use of robotics in the industrial world is justiﬁed by the great workload that a machine can bear and by the force and precision required by the speciﬁc operation. That is why robotic systems are placed into the so-called robotic cells, to keep them separated from human workers in order to protect them from possible damage. These robots are usually programmed to perform their movements at high speed and force and, without the help of informations acquired by speciﬁc sensors, are not able to guarantee the safety of the operator in every situation [1]. For these reasons, collaborative robots are speciﬁcally de- signed to work safely with the operator without the risk of hurting him by accident. According to the standard ISO 10218 [2] there are four types of collaborative features for robots: 1) Safety Monitored Stop: the robot works independently for most of the time and occasionally one or more operators enter its workspace. When this happens, the operators are detected by appropriate sensors and cause an almost complete stop of the movements of the robot; 2) Hand guiding/Path teaching: the robot is moved man- ually by the operator in order to learn the path to be performed and the amount of force to be applied; 3) Speed and Separation Monitoring: in this case the robot workspace is constantly monitored by vision systems, which track in real-time the position of the operators to reduce the speed of the robot movements according to their position, reaching a complete stop when they are too close; 4) Power and Force Limiting: this is the type of collabo- rative robot generally referred to in the collective imagi- nation. It has force and power limitations, to collaborate actively with operators without needing additional de- vices. When it receives unexpected forces overload, it is programmed to reach a complete and sudden stop. This type, although the most human-friendly, is not suited for most industrial operations because of the limited forces and powers that it can apply. For this reason it is usually used for special applications which does not require the use of high forces or speeds. If a collaborative robot by deﬁnition works alongside the human operator, it is necessary to identify effective means of communication that can allow the team to be efﬁcient [3]. In [4] it is suggested to think of a human-human team ﬁrst, and to reproduce the naturalness of their interaction in a robot- human team through voice commands (auditory channel) and gestures (visual channel). In this context Machine Vision plays a central role: it allows the robot to see the environment and/or focus on its speciﬁc task. This means making the robotic system more ﬂexible and automated, no longer a rigid and heavily-limitated system; for example by identifying the position of the objects to be picked up entirely on its own, without needing them to be in a speciﬁc and ﬁxed position [5]. Artiﬁcial intelligence has moved in parallel and, among the vast number of available algorithms, Neural Networks are retaking foot, also thanks to the constant hardware 28