USING MONOCULAR DEPTH CUES FOR MODELING STEREOSCOPIC 3D SALIENCY Iana Iatsun, Mohamed-Chaker Larabi, Christine Fernandez-Maloigne Departement XLIM-SIC UMR CNRS 7252, SP2MI, Teleport 2, Bvd Marie et Pierre Curie, BP 30179, 86962, Futuroscope, France ABSTRACT Saliency is one of the most important features in human visual per- ception. It is widely used nowadays for perceptually optimizing image processing algorithms. Several models have been proposed for 2D images and only few attempts can be observed for 3D ones. In this paper, we propose a stereoscopic 3D saliency model relying on 2D saliency features jointly with depth obtained from monocular cues. On the one hand, the use of 2D saliency features is justiﬁed psychophysically by the similarity observed between 2D and 3D at- tention maps. On the other hand, 3D perception is signiﬁcantly based on monocular cues. The validation of our model using state-of-the- art procedures including Kullback-Leibler divergence (KLD), area under the curve (AUC) and correlation coefﬁcient (CC) in compari- son with attention maps showed very good performance. Index Terms— Saliency, monocular depth cues, stereoscopic 3D, visual attention. 1. INTRODUCTION With the widespread of three dimensional (3D) imaging in televi- sion and movie production, advertisement and gaming, requirement of new algorithms for compression, transmission and display grew signiﬁcantly. Stereoscopic 3D (S3D) content delivery is more than ever possible and the viewer experience becomes a very hot topic. Therefore, exploring human visual system (HVS) capabilities is un- doubtedly an important step towards increasing visual comfort. The HVS allows to perceive the world in three dimensions and evaluate the distance to objects thanks to not only binocular indices but also monocular ones [1]. On the one hand, binocular cues, i.e. stereopsis and vergence, are achieved by using the information com- ing from left and right eyes. On the other hand, 3D perception relies mostly on monocular cues such as relative size, texture, occlusion, shadow and perspective. They are partly linked to a priori knowl- edge and cognitive information. Among the prominent characteristics of the HVS, visual atten- tion is playing an important role in exploring and understanding our environment. This process can be either top-down (i.e. task-driven) or bottom-up (i.e. stimuli-driven). Several works from the literature have focused on mimicking the HVS in order to produce saliency maps that predict areas attracting attention from images. The pio- neering work of Itti and Koch is based on the construction of con- spicuity maps from low level criteria such as intensity, color and ori- entation [2]. Several other works, mostly belonging to the bottom-up family, have tried to extend and improve the latter model [3, 4, 5]. While these models are efﬁcient and close to human attention, their complexity is often seen as an obstacle for real-time appli- cations. Computational approaches for constructing saliency maps have emerged in the recent years. One can cite the model of Achanta et al. [6] based on the fact that salient objects are those which are prominent beyond the local mean of the neighborhood. Another more recent computational method based on interest points detection was proposed by Nauge et al. [7]. It showed a correlation between interest points (IP) and gaze points. Although, the literature is rich in terms of saliency studies, 3D saliency has been rarely tackled. It has been reported that, there is a difference between 3D and 2D visual attention due to depth infor- mation. With the aim of studying the effect of depth on saliency and to analyze the difference 2D and 3D attention, several works have been done. For instance, Jansen et al. explored the effect of depth on human perception in a free-watching task using 2D and 3D still images [8]. It revealed the time-dependent effect of depth and differ- ence in eye-movement for 2D and 3D. Interestingly, they observed that viewers were ﬁxating earlier on the ﬁrst-plan objects even in 2D, using monocular cues. In the same vein, Huynh et al. performed similar study on 2D and 3D video, and led to the same conclusion as previously [9]. Despite this results, authors highlighted some simi- larities between 2D and 3D attention maps which pushed new works on the development of computational models of visual attention for 3D content relying on 2D salient features in conjunction with depth cues. In this context, a bottom-up attentional model for 3D video was proposed by Zhang et al. by integrating depth, luminance, color, motion and orientation [10]. Unfortunately, this work is lacking in terms of implementation details that help readers to understand the fusion of left and right saliency and in addition no quantitative eval- uation was provided. Recently, Wang et al. suggest the creation of a separate saliency map for a depth information and its fusion with 2D salient features [11]. Authors reported quantitative results on arbi- trarily chosen view, and the main limitation is linked to use of depth contrast only. Several similar work using depth information could be found in the literature [12, 13, 14]. However, this assumes the availability of depth information or the possibility of its computa- tion through disparity. As stated before, 3D perception is highly relying on monocular cues. One can assume that if it is possible to extract depth from only one view, this information will be of a great importance for predic- tion of 3D saliency. Following the idea, few attempts of depth ex- tracting from 2D images can be observed in the ﬁeld. Among them, the method suggested by Saxena et al. based on a priori learning process, which is unfortunately lacking in terms of genericity [15]. Monocular depth evaluation based on low level vision features and without any a priori information about scene, was proposed by Palou et al. It is based on the detection of T-junction points, that identify occlusions. This approach provides as output segmented areas de- pending on depth order. In this paper we propose to exploit monocular depth in addi- tion to 2D salient features in order to develop a 3D saliency model. The idea lies in the fact that there are similarities between 2D and 3D attention behavior. Moreover, depth can be accurately predicted from single 2D view. Therefore, our model performs a fusion be- 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) 978-1-4799-2893-4/14/$31.00 ©2014 IEEE 589