Reducing complexity of 3D Indoor Object Detection Roberta Maisano Centro Informatico d'Ateneo (CIAM) University of Messina Messina, Italy roberta.maisano@unime.it Valeria Tomaselli, Alessandro Capra Advanced System Technology STMicroelectronics Catania, Italy valeria.tomaselli@st.com alessandro.capra@st.com Francesco Longo, Antonio Puliafito Department of Engineering University of Messina Messina, Italy francesco.longo@unime.it antonio.puliafito@unime.it Abstract— This work deals with the problem of amodal perception of 3D object detection in indoor environments. We revisited a novel method of 3D object detection [4], in terms of complexity and runtime speed. 3D detection regards not only object localizations in the 3D world, but also estimating their physical sizes and poses, even if only parts of them are visible in the RGB-D image. By following the 2.5D representation approach, the system under study achieves a better mean average precision in detection (40.1%) respect to all recent methods, but the complexity of the system is very high and at the moment its implementation doesn’t fit a small device with low resources in memory and computation. We revisited the referenced system through a variation in its network architecture by introducing a well-adapted and "fine-tuned" MobileNet from Google with the goal of reducing the complexity of the whole system. Considerable reduction in complexity, computational cost (MAC operations) and memory requirements have been achieved. Many detected classes showed an acceptable level of accuracy and also the speed of the recognition system increased. Final experiments have been conducted on NYUV2 dataset. Keywords—3D object detection; CNN; MobileNet; complexity; embedded device; I. INTRODUCTION 3D scene understanding and object recognition are among the biggest challenges in computer vision. In particular object detection is a fundamental task for space perception in autonomous systems and hence perception models are essential in modern robot systems like social robots.Humans have remarkable abilities to recognize and also localize a large variety of objects in both low and high resolution images, despite the fact that objects may vary in different viewpoints, in many different sizes and scales. A huge number of works was proposed in the past few decades to infer bounding boxes around visible object parts within image plane. However even though human beings can effortlessly infer objects as a whole and in a complete way (by perception principles like closure: the tendency to complete figures that are incomplete), the 2D representation is far from human perception level in the physical world and from the requirements for some robotic applications where robots are expected to interact with the environment like robotics navigation and augmented reality. Traditionally, object detection makes use of 2D bounding boxes around parts on the image to represent and classify objects. But this representation, even though it gives us information about the localization of an object in the image plane, doesn’t give the physical position, size and exposure of the object in 3D space, due to a series of factors like occlusion and uncertainty of the perspective projection. So finding an appropriate environment representation is a crucial problem in robotics where 3D data are necessary. 3D data are recently available thanks to the advent of low cost RGB-D cameras like Microsoft Kinect. On the other hand, the first features extraction methods have been replaced by deep Convolutional Neural Networks (CNNs) in 2D image based object detection. Subsequently the RGB-Depth detection approaches went to two directions according to the way to formulate feature representations from RGB-Depth images: the 2.5D approaches and the 3D approaches. The competition on which of the two techniques is the best is still intensely open. Some recent works [1] [2] demonstrate that detections directly performed in 3D space is better dealing with occlusions over 2.5D approaches. Other algorithms [3], aligning CAD models with 2D detections, try to improve the 2.5D over the 3D approaches. 3D models present a lot of uncertainty due to the noisy and incomplete reconstructed 3D shape. This degrades the quality of the obtained surfaces that are less accurate than those obtained through CAD models and makes the consequent fitting bounding boxes a very hardly task. This happens especially when a big part of an object falls in a “black hole” of the depth map. Moreover 3D approaches make use of depth map in a different way in that 3D points are first reconstructed, then the processing is based on cloud points. This procedure is extremely computation expensive. On the other hand, recorded 2D images are dense and complete and humans can easily perceive objects and their 3D positions even if they are covered with occlusions or cluttered by other objects. Hence, with the current deep learning techniques it should be possible to reproduce this human attitude on the 2.5D images. A great contribution is recently given by [4] that, by sticking to the 2.5D representation framework, directly relates 2.5D visual appearance to 3D objects, simultaneously predicting objects 3D locations, physical sizes and orientations in indoor environments making experiments on NYUv2 [5] dataset. By naturally model the relationships between 2.5D features and 3D object localizations and full extent in single frame RGB-D data, they are the first to reformulate the 3D amodal detection problem as regressing class-wise 3D bounding box models based on 2.5D image appearance features only. In addition to this, they achieved the best accuracy level of mAP (mean Average Precision) of 40.9 on 19 detection classes on NYUV2 dataset.