978-1-5386-2497-5/18/$31.00 ©2018 IEEE
Deep Learning based Machine Vision: first steps
towards a hand gesture recognition set up
for Collaborative Robots
Cristina Nuzzi, Simone Pasinetti, Matteo Lancini, Franco Docchio, Giovanna Sansoni
Department of Mechanical and Industrial Engineering
University of Brescia, Via Branze 38, Brescia, Italy
c.nuzzi@unibs.it
Abstract—In this paper, we present a smart hand gesture
recognition experimental set up for collaborative robots using
a Faster R-CNN object detector to find the accurate position of
the hands in the RGB images taken from a Kinect v2 camera.
We used MATLAB to code the detector and a purposely designed
function for the prediction phase, necessary for detecting static
gestures in the way we have defined them.
We performed a number of experiments with different datasets
to evaluate the performances of the model in different situations:
a basic hand gestures dataset with four gestures performed by
the combination of both hands, a dataset where the actors wear
skin-like color clothes while performing the gestures, a dataset
where the actors wear light-blue gloves and a dataset similar to
the first one but with the camera placed close to the operator.
The same tests have been conducted in a situation where also
the face of the operator was detected by the algorithm, in order
to improve the prediction accuracy.
Our experiments show that the best model accuracy and F1-Score
are achieved by the complete model without the face detection.
We tested the model in real-time, achieving good performances
that can lead to real-time human-robot interaction, being the
inference time around 0.2 seconds.
Index Terms—collaborative robots, machine vision, deep learn-
ing, hand gesture recognition, Faster R-CNN, MATLAB
I. I NTRODUCTION
Robotic systems are nowadays a fundamental part of
the industrial world. Numerous types of these systems are
available on the market, and often new innovative prototypes
are designed depending on the required task.
The wide use of robotics in the industrial world is justified
by the great workload that a machine can bear and by the
force and precision required by the specific operation. That
is why robotic systems are placed into the so-called robotic
cells, to keep them separated from human workers in order
to protect them from possible damage.
These robots are usually programmed to perform their
movements at high speed and force and, without the help
of informations acquired by specific sensors, are not able to
guarantee the safety of the operator in every situation [1].
For these reasons, collaborative robots are specifically de-
signed to work safely with the operator without the risk of
hurting him by accident. According to the standard ISO 10218
[2] there are four types of collaborative features for robots:
1) Safety Monitored Stop: the robot works independently
for most of the time and occasionally one or more
operators enter its workspace. When this happens, the
operators are detected by appropriate sensors and cause
an almost complete stop of the movements of the robot;
2) Hand guiding/Path teaching: the robot is moved man-
ually by the operator in order to learn the path to be
performed and the amount of force to be applied;
3) Speed and Separation Monitoring: in this case the
robot workspace is constantly monitored by vision
systems, which track in real-time the position of the
operators to reduce the speed of the robot movements
according to their position, reaching a complete stop
when they are too close;
4) Power and Force Limiting: this is the type of collabo-
rative robot generally referred to in the collective imagi-
nation. It has force and power limitations, to collaborate
actively with operators without needing additional de-
vices. When it receives unexpected forces overload, it is
programmed to reach a complete and sudden stop. This
type, although the most human-friendly, is not suited for
most industrial operations because of the limited forces
and powers that it can apply. For this reason it is usually
used for special applications which does not require the
use of high forces or speeds.
If a collaborative robot by definition works alongside the
human operator, it is necessary to identify effective means
of communication that can allow the team to be efficient [3].
In [4] it is suggested to think of a human-human team first,
and to reproduce the naturalness of their interaction in a robot-
human team through voice commands (auditory channel) and
gestures (visual channel).
In this context Machine Vision plays a central role: it allows
the robot to see the environment and/or focus on its specific
task. This means making the robotic system more flexible and
automated, no longer a rigid and heavily-limitated system; for
example by identifying the position of the objects to be picked
up entirely on its own, without needing them to be in a specific
and fixed position [5].
Artificial intelligence has moved in parallel and, among
the vast number of available algorithms, Neural Networks
are retaking foot, also thanks to the constant hardware
28