Multi-Label Object Categorization Using Histograms of Global Relations Wail Mustafa ∗ , Hanchen Xiong † , Dirk Kraft ∗ , Sandor Szedmak † , Justus Piater † and Norbert Kr¨ uger ∗ ∗ Mærsk Mc-Kinney Møller Institute University of Southern Denmark, Campusvej 55, 5230 Odense C, Denmark. Email: wail@mmmi.sdu.dk † Institute of Computer Science, University of Innsbruck, Technikerstr.21a, A-6020 Innsbruck, Austria c 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Mustafa, W.; Hanchen Xiong; Kraft, D.; Szedmak, S.; Piater, J.; Kruger, N., ”Multi-label Object Categorization Using Histograms of Global Relations,” in 3D Vision (3DV), 2015 International Conference on , vol., no., pp.309-317, 19-22 Oct. 2015 DOI: 10.1109/3DV.2015.42 Abstract In this paper, we present an object categorization system capable of assigning multiple and related categories for novel objects using multi-label learning. In this system, objects are described using global geometric relations of 3D features. We propose using the Joint SVM method for learning and we investigate the extraction of hierarchical clusters as a higher-level description of objects to assist the learning. We make comparisons with other multi-label learning approaches as well as single-label approaches (including a state-of-the-art methods using different object descriptors). The experiments are carried out on a dataset of 100 ob- jects belonging to 13 visual and action-related categories. The results indicate that multi-label methods are able to identify the relation between the dependent categories and hence perform categorization accordingly. It is also found that extracting hierarchical clusters does not lead to gain in the system’s performance. The results also show that using histograms of global relations to describe objects leads to fast learning in terms of the number of samples required for training. I. . Introduction Object categorization is important for a variety of tasks, especially when systems are expected to deal with novel objects according to prior knowledge. Categorizing novel objects is useful in several applications such as driver assistance [16] and video surveillance [11]. In robotic applications in particular, categories can be linked to manipulation actions allowing for performing predeﬁned actions on novel objects (see e.g., [19]). Existing object categorization methods assume that ob- jects belong to single and distinct categories (e.g., ‘cup’ and ‘car’) [2], [25] and thus employ single-label learning. box like container, has rim, has handle, pour cylinder like, rollable bowl like, container, has rim stirring tool hammer, stirring tool insert, rollable spray Fig. 1. Examples of labeled objects. In this work, we consider scenarios in which objects can belong to multiple and related (by overlapping or nesting) categories (Fig. I) associated with, potentially, different levels of abstraction. Such scenarios are very common in everyday objects. The ability to learn categories of different abstraction levels allows for, e.g., associating ma- nipulation actions to visual patterns rather than designing actions for speciﬁc object instances. In the context of robotic manipulation actions, this means that multiple and dependent actions may be proposed as “affordances” for a novel object. For this learning problem, we utilize multi- label classiﬁcation [32], which is intrinsically able to learn to assign multiple labels per data sample while considering the interdependence of the labels. In contrast, single-label methods—even when conﬁgured in 1-versus-all fashion— are expected to perform poorly on dependent categories. In this work, objects are encoded using global de- scriptors composed of histograms of relative geometric attributes computed between full 3D features (Fig. 2). The 3D features are extracted using three RGB-D sensors (the three views are fused in 3D space) capturing object shapes rather completely. This description of objects is rich and highly invariant to viewpoint, leading to high performance and fast learning in terms of the number of samples required to train the system.