Reaching development through visuo-proprioceptive-tactile integration on a humanoid robot – a deep learning approach Phuong D.H. Nguyen 1,2 , Matej Hoffmann 3 , Ugo Pattacini 1 , Giorgio Metta 1 Abstract— The development of reaching in infants has been studied for nearly nine decades. Originally, it was thought that early reaching is visually guided, but more recent evidence is suggestive of “visually elicited” reaching, i.e. infant is gaz- ing at the object rather than its hand during the reaching movement. The importance of haptic feedback has also been emphasized. Inspired by these ﬁndings, in this work we use the simulated iCub humanoid robot to construct a model of reaching development. The robot is presented with different objects, gazes at them, and performs motor babbling with one of its arms. Successful contacts with the object are detected through tactile sensors on hand and forearm. Such events serve as the training set, constituted by images from the robot’s two eyes, head joints, tactile activation, and arm joints. A deep neural network is trained with images and head joints as inputs and arm conﬁguration and touch as output. After learning, the network can successfully infer arm conﬁgurations that would result in a successful reach, together with prediction of tactile activation (i.e. which body part would make contact). Our main contribution is twofold: (i) our pipeline is end- to-end from stereo images and head joints (6 DoF) to arm- torso conﬁgurations (10 DoF) and tactile activations, without any preprocessing, explicit coordinate transformations etc.; (ii) unique to this approach, reaches with multiple effectors corresponding to different regions of the sensitive skin are possible. I. I NTRODUCTION Infants develop the ability to reach to objects in their visual ﬁeld mainly during the ﬁrst year after birth (between 4 and 8 months of age in particular). The mechanisms leading to this capacity have been studied for almost nine decades, with seminal contributions by Jean Piaget (e.g., [1]), Claes von Hofsten (e.g., [2]) or Esther Thelen (e.g., [3]). The prevalent hypothesis for many decades, originating in Piaget’s work, was that of “visually-guided” reaching: infants need to look at their hands and the object alternately in order to progressively steer the hand closer to the object location (in robotics known under visual servoing). From the late 1970’s, evidence not in line with this hypothesis started to accumulate: von Hofsten [4] recorded where infants looked at and did not observe them gazing at their hand. Additional evidence was provided by Clifton et al. [5] who observed infants performing equally well when reaching for objects in the dark. This led to a turn toward the “visually-elicited” 1 iCub Facility, Istituto Italiano di Tec- nologia, Genova, Italy {phuong.nguyen, ugo.pattacini,giorgio.metta}@iit.it 2 Knowledge Technology Institute, Department of Informatics, Universit¨ at Hamburg, Germany pnguyen@informatik.uni-hamburg.de 3 Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czech Republic matej.hoffmann@fel.cvut.cz reaching hypothesis, whereby the infant looks at the target and continues to do so during the reaching (Corbetta et al. [6] provide additional support). What the infant needs to learn is thus essentially a kind of mapping between vision and proprioception (or the motor modality). A survey is provided by Corbetta et al. [7] or in this edited book [8]. In addition, infants also engage in haptic exploration [9] and tactile feedback when the object is successfully contacted may provide key reinforcement of the infant’s learning [10]. The infant exploration may initially be random—what has been dubbed motor babbling—and progressively become more systematic. Inspired by the ﬁndings reviewed above and in order to get insight into the mechanisms of early reaching develop- ment, we use a baby humanoid robot with anthropomorphic proportions, 2 cameras/eyes in a biomimetic arrangement, proprioception in the form of joint encoders, and artiﬁcial electronic skin covering the whole body: the iCub [11] (in this work – simulator). We employ the synthetic method- ology, or “understanding by building”, which is typical in cognitive developmental robotics [12]–[14]. The robot is presented with different objects, gazes at them, and simultaneously performs motor babbling with one of its arms. Successful contacts with the object are detected through tactile sensors on hand and forearm. Such events serve as the training samples, constituted by images from the robot’s two eyes, head joints, tactile activation, and arm joints. A deep neural network DNN is trained with images and head joints as inputs, and arm conﬁguration and touch as output. After learning, the network can successfully infer arm conﬁgurations that would result in a successful reach, together with prediction of tactile activation (i.e. which body part would make contact). Compared to the majority of existing models of reaching development—which will be re- viewed in Section II—this work takes advantage of a number of recent developments in robotics and machine learning that constitute important enablers for the model presented here. First, the advent of robotic skin technologies [15], [16] opens up the possibility of haptic exploration that is, in addition, not restricted to the hand/end-effector only. Second, the development of “deep learning” neural network architectures [17] makes it now possible to tackle complete pipelines from raw input images to motor output—without the need for preprocessing of the images (object segmentation, 3D from stereo), explicit coordinate transformations, etc. In addition, to make our training in the simulator more robust and ease the future transfer to the real robot, we employ the domain randomization technique [18].