UNSUPERVISED UNCERTAINTY ANALYSIS FOR VIDEO SALIENCY DETECTION Tariq Alshawi, Zhiling Long, and Ghassan AlRegib Center for Signal and Information Processing (CSIP) School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA 30332-0250, USA {talshawi, zhiling.long, alregib}@gatech.edu ABSTRACT This paper presents a new unsupervised uncertainty estima- tion method for video saliency detection using spatial cues of the saliency map. The algorithm exploits the relationship between a pixel and its spatial neighbours in saliency maps to estimate the uncertainty of the saliency detected at the pixel location. Unlike supervised methods that fits uncertainty model to available training data, the proposed algorithm is based on very simple observation of the eye fixation map, which is largely influenced by human visual attention mechanisms. Thus, the proposed method is data independent. The performance of the proposed algorithm is evaluated using the challenging CRCNS video dataset and quantified using Receiver Operating Characteristics (ROC). The results are promising and could lead to robust uncertainty estimation using eye-fixation neighbourhood modeling. Index Terms— unsupervised, uncertainty analysis, video, saliency detection, attention framework, spatial cor- relation I. INTRODUCTION Computational visual saliency detection methods attempt to predict interesting regions or objects in a given scene that potentially attract human attention. In bottom-up spatio- temporal saliency detection approaches, various low-level change indicators such as intensity, color, and motion are used as features to assess the amount of discrepancy between a pixel and its neighbours and hence infer its saliency. The output of these algorithms, typically called saliency maps, can be used as pre-processing stage to improve the performance as well as the efficiency of various image and video processing applications including compression, segmentation, and classification. The performance of saliency detection algorithms is typically evaluated by measuring detection results using images and videos datasets. These datasets such as CRCNS [1], MSRA [2], MIT [3], and SAVAM [4] typically contain different types of saliency groundtruth, e.g., eye-fixation records, bounding box for salient objects, and precise salient object segmentation. The variations in the definition of the groundtruth make the comparison between different algo- rithms applied on different datasets a very difficult task and in many times it results in inconsistent conclusions. It is common to have one algorithm outperforming another in one dataset but the situation may be reversed when tested using a different dataset, which can be attributed to variation in video sequences and dataset bias. For example, saliency detection algorithms that are conservative tend to achieve better scores in eye-fixation datasets compared to bounding box datasets due to overwhelming negative samples in the former compared to the latter. Additionally, conclusions regarding the performance of different algorithms tested using the same dataset may not be conclusive, because an algorithm can be designed to nicely fit a dataset regardless of the underling physical phenomenon. Factors such as groundtruth generation and evaluation methodology play a big role in the final performance score. Moreover, there are some important questions such as, how much increment in a performance metric is considered significant? or how big a gap between two algorithms needs to be in order to draw conclusions on their performance? These questions cannot be answered without a reliable measure of uncertainty. To ad- dress such important issues that are essential to the saliency detection problem, a thorough research and evaluation about the uncertainty inherent of the detection results has to be conducted. The authors in [5] proposed a supervised method to estimate the uncertainty associated with saliency detection of a video pixel given two simple features calculated using the spatial neighbourhood of a target pixel. The two features are the distance from the center of mass of the saliency map and the connectedness of the target pixel. The coor- dinates of the center of mass of saliency map [x c ,y c ] is first calculated before using the groundtruth saliency map. Then, the Euclidean distance, d, is used in the data-fitted probability of the pixel being salient given its distance from the center. i.e., p(s|d). Similarly, the connectedness feature, c, is calculated by counting the number of salient neighbours and calculating p(s|c). Finally, the uncertainty U of each pixel is calculated using the binary entropy of the likelihood estimates, p(s|d) and p(s|c). In this paper, we propose investigating spatial cues available within immediate neighbourhood of a pixel in a given frame for uncertainty estimation. More specifically, we propose constructing an uncertainty estimate by calculating pixel’s deviation from the spatial cues of its direct neighborhood, which adapts ,((( $VLORPDU