Multi-class RGB-D Object Detection and Semantic Segmentation for Autonomous Manipulation in Clutter Journal Title XX(X):1–13 The Author(s) 2016 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/ToBeAssigned www.sagepub.com/ Max Schwarz 1 , Anton Milan 2 , Arul Selvam Periyasamy 1 , and Sven Behnke 1 Abstract Autonomous robotic manipulation in clutter is challenging. A large variety of objects must be perceived in complex scenes, where they are partially occluded and embedded among many distractors. Varying lighting conditions and spatial restrictions also contribute to the diﬃculty. To tackle these challenges, we developed a deep-learning approach that combines object detection and semantic segmentation. The manipulation scenes are captured with RGB-D cameras, for which we developed a depth fusion method. Employing pretrained features makes learning from small annotated robotic data sets possible. We evaluate our approach on two challenging data sets: one captured for the Amazon Picking Challenge 2016, where our team NimbRo came in second in the Stowing and third in the Picking task, and one captured in disaster-response scenarios. The experiments show that object detection and semantic segmentation complement each other and can be combined to yield reliable object perception. Keywords Deep learning, object perception, RGB-D camera, transfer learning, object detection, semantic segmentation 1 Introduction Robots appear more and more frequently in our surroundings. They are typically deployed to assist humans in a variety of tasks, including industrial manufacturing, personal assistance or exploration. A reliable perception of the environment plays a vital role in managing these tasks successfully. Diﬀerent tasks may require diﬀerent levels of cognition. In some cases, it may be suﬃcient to classify certain structures and objects as obstacles in order to avoid them. In others, a more ﬁne-grained recognition is necessary, for instance to determine whether a speciﬁc object is present in the scene. For more sophisticated interaction, such as grasping and manipulating real world objects, a more precise scene understanding including object detection and pixel-wise semantic segmentation is essential. Over the past few years, research in all these domains has shown remarkable progress. This success is largely due to the rapid development of deep learning techniques that allow for end-to-end learning from examples, without the need for designing handcrafted features or introducing complex priors. Somewhat surprisingly, there are not many working examples to date that employ deep models in real-time robotic systems. In this paper, we ﬁrst demonstrate the application of deep learning methods to the task of bin-picking for warehouse automation. This particular problem setting has unique properties: While the surrounding environment is usually very structured— boxes, pallets and shelves—the sheer number and diversity of objects that need to be recognized and manipulated as well as their chaotic arrangement and spatial restrictions pose daring challenges to overcome. In addition to bin-picking, we also validate our approach in disaster-response scenarios. Contrary to bin-picking, this setting is much less structured, with highly varying and cluttered background and more diverse lighting conditions. These scenes may include many unknown objects. Despite their remarkable success, deep learning methods exhibit at least one major limitation. Due to a large number of parameters, they typically require vast amounts of hand-labeled training data to learn the features required for solving the task. To overcome this limitation, we follow the transfer learning approach, 1 University of Bonn 2 University of Adelaide Corresponding author: Max Schwarz, University of Bonn, Computer Science Institute VI, Friedrich-Ebert-Allee 144, 53113 Bonn, Germany Email: max.schwarz@uni-bonn.de Prepared using sagej.cls [Version: 2015/06/09 v1.01]