Real-time object detection and localization with SIFT-based clustering
☆
Paolo Piccinini
a
, Andrea Prati
b,
⁎, Rita Cucchiara
a
a
Department of Information Engineering, University of Modena and Reggio Emilia, Via Vignolese, 905/b, 41100 Modena, Italy
b
Department of Planning and Design of Complex Environments, University IUAV of Venice, Santa Croce 1957, 30135 Venezia, Italy
abstract article info
Article history:
Received 27 January 2011
Received in revised form 5 January 2012
Accepted 17 June 2012
Keywords:
Pick-and-place applications
Machine vision for industrial applications
SIFT
This paper presents an innovative approach for detecting and localizing duplicate objects in pick-and-place
applications under extreme conditions of occlusion, where standard appearance-based approaches are likely
to be ineffective. The approach exploits SIFT keypoint extraction and mean shift clustering to partition the
correspondences between the object model and the image onto different potential object instances with
real-time performance. Then, the hypotheses of the object shape are validated by a projection with a fast
Euclidean transform of some delimiting points onto the current image. Moreover, in order to improve the de-
tection in the case of reflective or transparent objects, multiple object models (of both the same and different
faces of the object) are used and fused together. Many measures of efficacy and efficiency are provided on
random disposals of heavily-occluded objects, with a specific focus on real-time processing. Experimental re-
sults on different and challenging kinds of objects are reported.
© 2012 Elsevier B.V. All rights reserved.
1. Introduction
Information technologies have become in the last decades a fun-
damental aid for helping the automation of everyday people life and
industrial processes. Among the many different disciplines contribut-
ing to this process, machine vision and pattern recognition have been
widely used for industrial applications and especially for robot vision.
A typical need is to automate the pick-and-place process of picking up
objects, possibly performing some tasks, and then placing them down
on a different location. Most of the pick-and-place systems are basi-
cally composed of robotic systems and sensors. The sensors are in
charge of driving the robot arms to the right 3D location and possibly
orientation of the next object to be picked up, according to the robot's
degrees of freedom. Object picking can be very complicated if the
scene is not well structured and constrained.
The automation of object picking by using cameras, however, re-
quires to detect and localize objects in the scene; they are crucial
tasks for several other computer vision applications, such as image/
video retrieval [1,2], or automatic robot navigation [3].
This paper describes a new complete approach for pick-and-place
processes with the following challenging requirements:
1. Different types of objects: the approach should work with every type
of object of different dimension and complexity, with reflective sur-
faces or semi-transparent parts, such as in the case of pharmaceutical
and cosmetic objects, often reflective or included in transparent
flowpacks;
2. Random object disposal: most of the picking systems consider the
case of well separated objects, well aligned on the belt and with a
synchronized grasping of the objects. We would like to generalize
the problem by relaxing these constraints. The ultimate goal is to
work directly in bins (problem known as bin picking [4]), for sav-
ing time and/or for hygienic reasons, as shown in Fig. 1(b) and (c);
3. Multiple instances and distractors: in the case of pick-and-place
applications the aim is not limited to count and classify the first
(or best) instance, but to determine the locations, orientations
and sizes of all (or most of) the duplicates/instances. Object dupli-
cates can have different sizes, poses and orientations, and they can
be seen from different viewpoints and under different illumina-
tion. Moreover, in real applications the system must also account
for the presence of distractors, i.e. other types of objects, different
from the target one (see, for instance, Fig. 1(d)), that should not be
detected;
4. Heavily-occluded objects: as a consequence of requirements 1 and
1, objects can be severely occluded (see Fig. 1);
5. High working speed: the required working speed is very high; a
fast detection technique should be adopted to work more than a
hundred of objects per minute.
Machine vision often exploits a 3D CAD model of the object [5–7].
In particular, the active appearance models used for 3D face matching
in [7] provide fast and accurate object matching. They may, however,
result unsuitable for pick-and-place applications because of the illu-
mination variations (e.g., the reflexes due to flowpacks), the severe
occlusions and deformability of the objects.
Image and Vision Computing 30 (2012) 573–587
☆ This paper has been recommended for acceptance by Ian Reid.
⁎ Corresponding author. Tel.: +39 0412572169.
E-mail addresses: paolo.piccinini@unimore.it (P. Piccinini), andrea.prati@iuav.it
(A. Prati), rita.cucchiara@unimore.it (R. Cucchiara).
0262-8856/$ – see front matter © 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.imavis.2012.06.004
Contents lists available at SciVerse ScienceDirect
Image and Vision Computing
journal homepage: www.elsevier.com/locate/imavis