SALIENT OBJECT DETECTION IN IMAGE SEQUENCES VIA SPATIAL-TEMPORAL CUE Chuang Gan 1,2 , Zengchang Qin 2 , Jia Xu 1 , Tao Wan 2,3 1 Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2 Intelligent Computing and Machine Learning Lab, Beihang University, Beijing, China 3 School of Biological Science and Medical Engineering, Beihang University, Beijing, China ABSTRACT There are large amounts of videos available online with the increasing popularity of Internet world wide. It is a non- trivial task for accurately searching and categorizing these videos due to the variety of contents contained in the videos. Visual saliency models provide a possible way to solve this problem by locating and extracting salient objects from the background within images, which can reduce the search ef- fort and assist object detection and recognition tasks. Com- pared to static images, videos contain motion information which might be more likely to attract human attention. In this paper, we present a new region contrast based saliency detection model using spatial-temporal cues (RCST). We ex- tend the general static image saliency computation model to handle videos by incorporating both spatial and temporal fea- tures. Four general saliency principles and three methods are introduced to evaluate the saliency detection performances of the RCST based method in terms of qualitative and quantita- tive evaluations using a publicly available video segmentation database. The experimental results demonstrate that our algo- rithm outperforms the existing state-of-the-arts methods. Index Terms— object detection, saliency, spatial- temporal cue. 1. INTRODUCTION Rapid development of computer infrastructure including in- creased speed of processors, less expensive but increasing ca- pacity of storage device and easily accessible Internet have brought in a vast number of videos in past decades. It is a great challenge for us to search or categorize these videos. Salient areas in an image or a video are generally regarded as the focus in human eyes. Visual saliency models can help us to locate salient objects from the background for the purpose of effective search. Saliency detection can help audience to locate the most attractive and important content from exten- sive images and videos. This work was supported in part by the National Basic Research Pro- gram of China Grant 2011CBA00300, 2011CBA00301, the National Natural Science Foundation of China Grant 61033001, 61061130540. Emails: zc- qin@buaa.edu.cn, tao.wan.wan@gmail.com (a) (b) (c) (d) Fig. 1. Three examples illustrating the saliency detection re- sults compared to ground truth. From left to right: (a) input frames, (b) graph based segmentation, (c) salient objects de- tected by our method, and (d) ground truth. Visual saliency is originally a task of predicting the eye- ﬁxations on images, and recently has been extended to locate a region containing the salient object. There are various appli- cations including the salient object detection and recognition [1, 2], image compression [3], image cropping [4], image re- trieval [5], photo collage [6, 7] and so on. The study on human visual systems suggests that the saliency is related to unique- ness, rarity and surprise of a scene, characterized by primi- tive features like color, texture, shape and etc. For example, Fig. 1 illustrates the procedure of graph segmentation based saliency detection for an input frame. Recently, a lot of efforts have been made to design various algorithms to compute the saliency for static images [8, 9, 10, 11, 12, 13]. However, there is not much literature to consider extending the saliency computation models to videos related tasks. In this paper, we propose a novel saliency model combin- ing both color and motion feature to create a saliency map. Different from previous work such as [12, 14], our method can be summarized by the following four steps: 1. Initial graph based image segmentation. 2. Local contrast based region reﬁnement.