T. Huang et al. (Eds.): ICONIP 2012, Part IV, LNCS 7666, pp. 172–179, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Grasping Region Identification in Novel Objects
Using Microsoft Kinect
Akshara Rai, Prem Kumar Patchaikani, Mridul Agarwal, Rohit Gupta,
and Laxmidhar Behera
Department of Electrical Engineering, Indian Institute of Technology Kanpur, India
{akshara,premkani,mridagar,grohit,lbehera}@iitk.ac.in
Abstract. We present a novel solution to the problem of robotic grasping of
unknown objects using a machine learning framework and a Microsoft Kinect
sensor. Using only image features, without the aid of a 3D model of the object,
we implement a learning algorithm that identifies grasping regions in 2D im-
ages, and generalizes well to objects not encountered previously. Thereafter, we
demonstrate the algorithm on the RGB images taken by a Kinect sensor of real
life objects. We obtain the 3D world coordinates utilizing the depth sensor of
the Kinect. The robot manipulator is then used to grasp the object at the grasp-
ing point.
1 Introduction
We consider the problem of grasping of novel objects by the robot. If we are aiming
at grasping a previously known object, with a known 3D model, there are methods
available, such as those described in Miller et al., 2003 based on pre-stored primitives.
However, obtaining a full and accurate 3D reconstruction of new objects in a practical
scenario is infeasible, more so with only two images available. In other works, an
estimate of the 3D model of the object is created by manipulating it using a robotic
hand, which is typically time consuming and not robust.
In contrast to these approaches, we employ a learning algorithm that neither re-
quires nor tries to build a 3D model of the object. Instead it directly identifies, as a
function of the image features and properties, a point at which to grasp the object.
Informally, the algorithm takes a picture of the object, and then tries to identify a
point within the 2D image that corresponds to a good point at which to grasp the
object. (For example, if trying to grasp a coffee mug, it might try to identify the mid-
point of the handle.) The learning is based solely on image features and no 3D infor-
mation is required. The real world 3D coordinates were determined from depth stream
of Kinect sensor. This eliminates computationally expensive steps required for stereo
vision (as done by Saxena et al.).
In the experiments conducted, a grasping region is identified on the RGB image of
the scene from the RGB camera of the Kinect. The depth and image sensors are cali-
brated intrinsically as well as extrinsically to a robot base frame. Using the above
identified grasp region, a grasping point in 3D is isolated with respect to the robot
base frame and the robot is programmed to grasp it at that location.