USING MONOCULAR DEPTH CUES FOR MODELING STEREOSCOPIC 3D SALIENCY
Iana Iatsun, Mohamed-Chaker Larabi, Christine Fernandez-Maloigne
Departement XLIM-SIC UMR CNRS 7252,
SP2MI, Teleport 2, Bvd Marie et Pierre Curie, BP 30179, 86962, Futuroscope, France
ABSTRACT
Saliency is one of the most important features in human visual per-
ception. It is widely used nowadays for perceptually optimizing
image processing algorithms. Several models have been proposed
for 2D images and only few attempts can be observed for 3D ones.
In this paper, we propose a stereoscopic 3D saliency model relying
on 2D saliency features jointly with depth obtained from monocular
cues. On the one hand, the use of 2D saliency features is justified
psychophysically by the similarity observed between 2D and 3D at-
tention maps. On the other hand, 3D perception is significantly based
on monocular cues. The validation of our model using state-of-the-
art procedures including Kullback-Leibler divergence (KLD), area
under the curve (AUC) and correlation coefficient (CC) in compari-
son with attention maps showed very good performance.
Index Terms— Saliency, monocular depth cues, stereoscopic
3D, visual attention.
1. INTRODUCTION
With the widespread of three dimensional (3D) imaging in televi-
sion and movie production, advertisement and gaming, requirement
of new algorithms for compression, transmission and display grew
significantly. Stereoscopic 3D (S3D) content delivery is more than
ever possible and the viewer experience becomes a very hot topic.
Therefore, exploring human visual system (HVS) capabilities is un-
doubtedly an important step towards increasing visual comfort.
The HVS allows to perceive the world in three dimensions and
evaluate the distance to objects thanks to not only binocular indices
but also monocular ones [1]. On the one hand, binocular cues, i.e.
stereopsis and vergence, are achieved by using the information com-
ing from left and right eyes. On the other hand, 3D perception relies
mostly on monocular cues such as relative size, texture, occlusion,
shadow and perspective. They are partly linked to a priori knowl-
edge and cognitive information.
Among the prominent characteristics of the HVS, visual atten-
tion is playing an important role in exploring and understanding our
environment. This process can be either top-down (i.e. task-driven)
or bottom-up (i.e. stimuli-driven). Several works from the literature
have focused on mimicking the HVS in order to produce saliency
maps that predict areas attracting attention from images. The pio-
neering work of Itti and Koch is based on the construction of con-
spicuity maps from low level criteria such as intensity, color and ori-
entation [2]. Several other works, mostly belonging to the bottom-up
family, have tried to extend and improve the latter model [3, 4, 5].
While these models are efficient and close to human attention,
their complexity is often seen as an obstacle for real-time appli-
cations. Computational approaches for constructing saliency maps
have emerged in the recent years. One can cite the model of Achanta
et al. [6] based on the fact that salient objects are those which are
prominent beyond the local mean of the neighborhood. Another
more recent computational method based on interest points detection
was proposed by Nauge et al. [7]. It showed a correlation between
interest points (IP) and gaze points.
Although, the literature is rich in terms of saliency studies, 3D
saliency has been rarely tackled. It has been reported that, there is
a difference between 3D and 2D visual attention due to depth infor-
mation. With the aim of studying the effect of depth on saliency and
to analyze the difference 2D and 3D attention, several works have
been done. For instance, Jansen et al. explored the effect of depth
on human perception in a free-watching task using 2D and 3D still
images [8]. It revealed the time-dependent effect of depth and differ-
ence in eye-movement for 2D and 3D. Interestingly, they observed
that viewers were fixating earlier on the first-plan objects even in
2D, using monocular cues. In the same vein, Huynh et al. performed
similar study on 2D and 3D video, and led to the same conclusion as
previously [9]. Despite this results, authors highlighted some simi-
larities between 2D and 3D attention maps which pushed new works
on the development of computational models of visual attention for
3D content relying on 2D salient features in conjunction with depth
cues. In this context, a bottom-up attentional model for 3D video
was proposed by Zhang et al. by integrating depth, luminance, color,
motion and orientation [10]. Unfortunately, this work is lacking in
terms of implementation details that help readers to understand the
fusion of left and right saliency and in addition no quantitative eval-
uation was provided. Recently, Wang et al. suggest the creation of a
separate saliency map for a depth information and its fusion with 2D
salient features [11]. Authors reported quantitative results on arbi-
trarily chosen view, and the main limitation is linked to use of depth
contrast only. Several similar work using depth information could
be found in the literature [12, 13, 14]. However, this assumes the
availability of depth information or the possibility of its computa-
tion through disparity.
As stated before, 3D perception is highly relying on monocular
cues. One can assume that if it is possible to extract depth from only
one view, this information will be of a great importance for predic-
tion of 3D saliency. Following the idea, few attempts of depth ex-
tracting from 2D images can be observed in the field. Among them,
the method suggested by Saxena et al. based on a priori learning
process, which is unfortunately lacking in terms of genericity [15].
Monocular depth evaluation based on low level vision features and
without any a priori information about scene, was proposed by Palou
et al. It is based on the detection of T-junction points, that identify
occlusions. This approach provides as output segmented areas de-
pending on depth order.
In this paper we propose to exploit monocular depth in addi-
tion to 2D salient features in order to develop a 3D saliency model.
The idea lies in the fact that there are similarities between 2D and
3D attention behavior. Moreover, depth can be accurately predicted
from single 2D view. Therefore, our model performs a fusion be-
2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)
978-1-4799-2893-4/14/$31.00 ©2014 IEEE 589