T. Huang et al. (Eds.): ICONIP 2012, Part IV, LNCS 7666, pp. 172–179, 2012. © Springer-Verlag Berlin Heidelberg 2012 Grasping Region Identification in Novel Objects Using Microsoft Kinect Akshara Rai, Prem Kumar Patchaikani, Mridul Agarwal, Rohit Gupta, and Laxmidhar Behera Department of Electrical Engineering, Indian Institute of Technology Kanpur, India {akshara,premkani,mridagar,grohit,lbehera}@iitk.ac.in Abstract. We present a novel solution to the problem of robotic grasping of unknown objects using a machine learning framework and a Microsoft Kinect sensor. Using only image features, without the aid of a 3D model of the object, we implement a learning algorithm that identifies grasping regions in 2D im- ages, and generalizes well to objects not encountered previously. Thereafter, we demonstrate the algorithm on the RGB images taken by a Kinect sensor of real life objects. We obtain the 3D world coordinates utilizing the depth sensor of the Kinect. The robot manipulator is then used to grasp the object at the grasp- ing point. 1 Introduction We consider the problem of grasping of novel objects by the robot. If we are aiming at grasping a previously known object, with a known 3D model, there are methods available, such as those described in Miller et al., 2003 based on pre-stored primitives. However, obtaining a full and accurate 3D reconstruction of new objects in a practical scenario is infeasible, more so with only two images available. In other works, an estimate of the 3D model of the object is created by manipulating it using a robotic hand, which is typically time consuming and not robust. In contrast to these approaches, we employ a learning algorithm that neither re- quires nor tries to build a 3D model of the object. Instead it directly identifies, as a function of the image features and properties, a point at which to grasp the object. Informally, the algorithm takes a picture of the object, and then tries to identify a point within the 2D image that corresponds to a good point at which to grasp the object. (For example, if trying to grasp a coffee mug, it might try to identify the mid- point of the handle.) The learning is based solely on image features and no 3D infor- mation is required. The real world 3D coordinates were determined from depth stream of Kinect sensor. This eliminates computationally expensive steps required for stereo vision (as done by Saxena et al.). In the experiments conducted, a grasping region is identified on the RGB image of the scene from the RGB camera of the Kinect. The depth and image sensors are cali- brated intrinsically as well as extrinsically to a robot base frame. Using the above identified grasp region, a grasping point in 3D is isolated with respect to the robot base frame and the robot is programmed to grasp it at that location.