A Fuzzy Associative Approach for Recognition of 3D Objects in Arbitrary Pose Aaron Mavrinac, Ahmad Shawky, and Xiang Chen Abstract— Once the human vision system has seen a 3D object from a few different viewpoints, depending on the nature of the object, it can generally recognize that object from new arbitrary viewpoints. This useful interpolative skill relies on the highly complex pattern matching systems in the human brain, but the general idea can be applied to a computer vision recognition system using comparatively simple machine learning techniques. An approach to the recognition of 3D objects in arbitrary pose relative the the vision equipment given only a limited training set of views is presented. This approach involves computing a disparity map using stereo cameras, extracting a set of features from the disparity map, and classifying it via a fuzzy associative map to a trained object. I. I NTRODUCTION Humans are generally able to recognize 2D shapes, regard- less of changes in orientation, scale, or skew, after having seen the shape in one such conﬁguration. This shape recog- nition has a very wide range of applications, and accordingly, much work has gone into automating it with computers. The basic theory is that shapes can be extracted from otherwise cluttered and cumbersome images, from which some set of quantiﬁers efﬁciently describing the shapes can be obtained and compared to known values through some algorithm for classiﬁcation. The nature of these quantiﬁers and the classiﬁcation algorithm are a subject of much research; most use quantiﬁers invariant to the aforementioned transforma- tions (rotation, scale, skew, etc.) such as Fourier descriptors, moment invariants, and Hough transformations, and most use machine learning methods such as fuzzy logic and neural networks for classiﬁcation. Humans are also generally able to recognize 3D objects, regardless of their orientation, after having seen a sufﬁcient number of different views (depending, of course, on the nature of the object itself). To generalize from the 2D case, it is possible to automate this process in a similar manner by obtaining quantiﬁers describing the 3D surface rather than the 2D shape. Such quantiﬁers can be extracted from range images, or in the case of stereo vision, disparity maps. However, a single such image gives information only from a certain perspective; this is commonly referred to as 2.5D. To approach full 3D information, range images must be taken from different perspectives around the object. For classiﬁcation to continue to work as generalized from the 2D case, the sets of quantiﬁers from each perspective must be combined to fully describe the object, and the classiﬁcation algorithm must be designed to operate on this type of information. In this paper, we expand on previous work in object recog- nition using invariant values on 2D images [10], justifying the selection of proper invariant descriptors for 3D shapes based on disparity maps and modifying the classiﬁcation scheme to reﬂect the new object description. The result is a system capable of recognizing a trained object based on a disparity map taken by a stereo camera rig from any view, where training requires only a few different such views. II. PRELIMINARY THEORY A. Disparity Map We assume a stereo vision system capable of generating rectiﬁed stereo images, wherein the epipolar lines are parallel and horizontally aligned as if captured by parallel cameras. In the general case, this requires internal and external (stereo) calibration of the cameras, which is beyond the scope of this work; for a thorough geometrical treatment see [3], [29], and for some practical methods see [4], [5], [6]. Throughout this paper, the following convention is used for the world and image coordinate systems: lowercase x and y represent image coordinates starting at the upper left corner, and uppercase X, Y , and Z represent world coordinates (which, unless otherwise speciﬁed, are mutually orthogonal with Z perpendicular to the rectiﬁed image planes and have their origin at the optical center of the left camera). Figure 1 illustrates their relationship. Fig. 1. Coordinate System Convention Given a pixel of coordindates (x 1 ,y 1 ) in one image of an epipolar-rectiﬁed stereo pair, and a corresponding pixel (x 2 ,y 2 ) in the other (where y 1 = y 2 ), their disparity d is deﬁned as x 2 − x 1 [29]. This can be used to triangulate the depth to the original 3D point in the environment (from 710 978-1-4244-1819-0/08/$25.00 c 2008 IEEE