Semantic Grouping of Visual Features Alexandra Teynor and Hans Burkhardt Department of Computer Science, Albert-Ludwigs-Universit¨ at Freiburg, Germany {teynor, Hans.Burkhardt}@informatik.uni-freiburg.de Abstract Many current object class models build on visual parts that constitute an object. However, visually differ- ent entities may actually refer to the same object part. This may be harmful for part based object class mod- els. We present a method how visually distinct parts with the same semantic role can be associated by cre- ating groupings based on the similarity of their occur- rence distributions. Experimental results verify that more compact class representations can be built based on these groupings, which lead to improved classiﬁca- tion performance and/or reduced classiﬁcation time. 1. Introduction A common technique for the recognition of object classes is the use of part dictionaries or “visual code- books”. These codebooks contain a variety of possi- ble image structures. Whenever visual codebooks are created, e.g., by clustering appearance features from lo- cal image patches, we only have a visual, not a seman- tic grouping of object parts. The variety in the visual appearance of semantically equal object parts are due to several reasons. First, there are natural intra class variabilities. Then, we also have to deal with differ- ent poses, e.g., a mouth might be open, shut, or smil- ing showing the teeth. But also other reasons exist: current feature extraction methods often rely on inter- est point detections which are not always on the same locations on different object instances. This might re- sult in shifted local windows for the same object part. So an eye might not always occur at the center of a lo- cal window, but also slightly shifted to the left or right. The features extracted from such shifted windows can be quite different. Invariance towards such shifts might be incorporated into the local features, but some very successful features like the SIFT features by Lowe [5] deliberately do not only consider the frequency of cer- tain structures, but also their location. These types of features are affected by shifts in the detected structure. Depending on the classiﬁcation strategy, a separate treatment of semantically similar parts might be harm- ful. Especially when using “bag-of-feature” type ap- proaches, parts with the same role are assigned to dif- ferent dictionary entries. Distance calculation between part histograms is typically performed in a bin-by-bin fashion, so performance can be degraded by not relat- ing semantically similar parts. In this work, we present a novel way on how to per- form a semantic grouping of object parts. Parts with a different visual appearance but with the same semantic role are associated by the similarity of their occurrence distributions given the object class. 2. Related work Previous work concerning the semantic grouping of visual structures has been performed by Leibe [4] or Epshtein and Ullman [1]. Leibe combines visual parts by co-location and co-activation clustering. His ap- proach is similar to ours as he also tries to associate parts that occur at the same location in an image, but he uses a weighted variation of the Hausdorff distance to combine visual parts. He does not apply his proce- dure to part frequency based object class models, as he advocates a Hough transform like voting method. Epshtein and Ullman use the context of parts in a prob- abilistic framework. They identify the geometric rela- tion of parts co-occurring with a basic “root fragment”, and search for similar constellations in test images. Our approach does not need a root fragment, but creates a number of groupings based on the desired similarity of the occurrence distributions. 3. Method The basic idea is that object parts with the same se- mantic meaning occur at the same location(s) on an object. For example, the mouth is always located in 978-1-4244-2175-6/08/$25.00 ©2008 IEEE