An Experimental Comparison of Three Guiding Principles for the Detection of Salient Image Locations: Stability, Complexity, and Discrimination Dashan Gao Nuno Vasconcelos Department of Electrical and Computer Engineering, University of California, San Diego dgao@ucsd.edu nuno@ece.ucsd.edu Abstract We present an experimental comparison of the performance of representative saliency detectors from three guiding prin- ciples for the detection of salient image locations: locations of maximum stability with respect to image transformations, locations of greatest image complexity, and most discrimi- nant locations. It is shown that discriminant saliency per- forms better in terms of 1) capturing relevant information for classification, 2) being more robust to image clutter, and 3) exhibiting greater stability to image transformations as- sociated with variations of 3D object pose. We then investi- gate the dependence of discriminant saliency on the under- lying set of candidate discriminant features, by comparing the performance achieved with three popular feature sets: the discrete cosine transform, a Gabor, and a Haar wavelet decomposition. It is show that, even though different feature sets produce equivalent results, there may be advantages in considering features explicitly learned from examples of the image classes of interest. 1. Introduction Saliency mechanisms play an important role in the ability of biological vision systems to perform visual recognition from cluttered scenes. In the computer vision literature, the extraction of salient points from images has been a subject of research for, at least, a few decades. Broadly speaking, existing saliency detectors can be divided into four major classes. The first, and most popular, treats the problem as one of the detection of specific visual attributes. These are usually edges or corners (also called “interest points”). For exam- ple, Harris [1] and F¨ ostner [2] measure an auto-correlation matrix at each image location and then compute its eigen- values to determine whether that location belongs to a flat image region, an edge, or a corner. While these detectors are optimal in the sense of finding salient locations of maximal stability with respect to certain image transformations, there have also been proposals for the detection of other low-level visual attributes, e.g. contours [3]. These basic detectors can then be embedded in scale-space [12], to achieve de- tection invariance with respect to transformations such as scale [13], or affine mappings [14]. A second major class of saliency detectors is based on more generic, data-driven, definitions of saliency. In par- ticular, an idea that has recently gained some popularity is to define saliency as image complexity. Various complex- ity measures have been proposed: Lowe [4] measures com- plexity by computing the intensity variation in an image us- ing the difference of Gaussian function; Sebe [5] measures the absolute value of the coefficients of a wavelet decompo- sition of the image; and Kadir [6] relies on the entropy of the distribution of local image intensities. The main advan- tage of the definitions in this class is a significantly greater flexibility, that makes them able to detect any of the low- level attributes discussed above (corners, contours, smooth edges, etc.) depending on the image under consideration. A third formulation is to start from models of biologi- cal vision, and derive saliency detection algorithms from these models [7, 22]. This formulation has the appeal of its roots on what are the only known full-functioning vision systems, and has been shown to lead to interesting saliency behavior [7,22]. Interestingly, however, human experiments conducted by the proponents of some of these models have shown that, even in relatively straightforward saliency ex- periments, where subjects are 1) shown images that they have already seen and 2) simply asked to point out salient regions, people do not seem to agree on more than about 50% of the salient locations [22]. This seems to rule out all saliency principles that, like those discussed so far, are ex- clusively based on universal laws which do not depend on some form of 1) context (e.g. a higher level goal that drives saliency) or 2) interpretation of image content. A final formulation that addresses this problem is di- rectly grounded on the recognition problem, equating saliency to discriminant power: it defines salient locations as those that most differentiate the visual class of interest from all others [10,11,15]. Under this formulation, saliency requires a preliminary stage of feature selection, based on some suitable measure of how discriminant each feature is 1