Real-time object detection and localization with SIFT-based clustering ☆ Paolo Piccinini a , Andrea Prati b, ⁎, Rita Cucchiara a a Department of Information Engineering, University of Modena and Reggio Emilia, Via Vignolese, 905/b, 41100 Modena, Italy b Department of Planning and Design of Complex Environments, University IUAV of Venice, Santa Croce 1957, 30135 Venezia, Italy abstract article info Article history: Received 27 January 2011 Received in revised form 5 January 2012 Accepted 17 June 2012 Keywords: Pick-and-place applications Machine vision for industrial applications SIFT This paper presents an innovative approach for detecting and localizing duplicate objects in pick-and-place applications under extreme conditions of occlusion, where standard appearance-based approaches are likely to be ineffective. The approach exploits SIFT keypoint extraction and mean shift clustering to partition the correspondences between the object model and the image onto different potential object instances with real-time performance. Then, the hypotheses of the object shape are validated by a projection with a fast Euclidean transform of some delimiting points onto the current image. Moreover, in order to improve the de- tection in the case of reﬂective or transparent objects, multiple object models (of both the same and different faces of the object) are used and fused together. Many measures of efﬁcacy and efﬁciency are provided on random disposals of heavily-occluded objects, with a speciﬁc focus on real-time processing. Experimental re- sults on different and challenging kinds of objects are reported. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Information technologies have become in the last decades a fun- damental aid for helping the automation of everyday people life and industrial processes. Among the many different disciplines contribut- ing to this process, machine vision and pattern recognition have been widely used for industrial applications and especially for robot vision. A typical need is to automate the pick-and-place process of picking up objects, possibly performing some tasks, and then placing them down on a different location. Most of the pick-and-place systems are basi- cally composed of robotic systems and sensors. The sensors are in charge of driving the robot arms to the right 3D location and possibly orientation of the next object to be picked up, according to the robot's degrees of freedom. Object picking can be very complicated if the scene is not well structured and constrained. The automation of object picking by using cameras, however, re- quires to detect and localize objects in the scene; they are crucial tasks for several other computer vision applications, such as image/ video retrieval [1,2], or automatic robot navigation [3]. This paper describes a new complete approach for pick-and-place processes with the following challenging requirements: 1. Different types of objects: the approach should work with every type of object of different dimension and complexity, with reﬂective sur- faces or semi-transparent parts, such as in the case of pharmaceutical and cosmetic objects, often reﬂective or included in transparent ﬂowpacks; 2. Random object disposal: most of the picking systems consider the case of well separated objects, well aligned on the belt and with a synchronized grasping of the objects. We would like to generalize the problem by relaxing these constraints. The ultimate goal is to work directly in bins (problem known as bin picking [4]), for sav- ing time and/or for hygienic reasons, as shown in Fig. 1(b) and (c); 3. Multiple instances and distractors: in the case of pick-and-place applications the aim is not limited to count and classify the ﬁrst (or best) instance, but to determine the locations, orientations and sizes of all (or most of) the duplicates/instances. Object dupli- cates can have different sizes, poses and orientations, and they can be seen from different viewpoints and under different illumina- tion. Moreover, in real applications the system must also account for the presence of distractors, i.e. other types of objects, different from the target one (see, for instance, Fig. 1(d)), that should not be detected; 4. Heavily-occluded objects: as a consequence of requirements 1 and 1, objects can be severely occluded (see Fig. 1); 5. High working speed: the required working speed is very high; a fast detection technique should be adopted to work more than a hundred of objects per minute. Machine vision often exploits a 3D CAD model of the object [5–7]. In particular, the active appearance models used for 3D face matching in [7] provide fast and accurate object matching. They may, however, result unsuitable for pick-and-place applications because of the illu- mination variations (e.g., the reﬂexes due to ﬂowpacks), the severe occlusions and deformability of the objects. Image and Vision Computing 30 (2012) 573–587 ☆ This paper has been recommended for acceptance by Ian Reid. ⁎ Corresponding author. Tel.: +39 0412572169. E-mail addresses: paolo.piccinini@unimore.it (P. Piccinini), andrea.prati@iuav.it (A. Prati), rita.cucchiara@unimore.it (R. Cucchiara). 0262-8856/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2012.06.004 Contents lists available at SciVerse ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis