An Image Retrieval Based Solution for
Correspondence Problem in Binocular Vision
Alberto Amato
*
, Vincenzo Piuri
DTI - Università degli Studi di Milano
Via Bramante 65, 26013 Crema (CR)
Italy
{alberto.amato, vincenzo.piuri}@unimi.it
Vincenzo Di Lecce
DIASS - Politecnico di Bari
Via Alcide De Gasperi, 74100 Taranto (TA)
Italy
v.dilecce@aeflab.net
Abstract— Aim of this paper is to propose a solution to the
correspondence problem in multi-camera systems. In these
systems, two or more cameras are used to record the same
scene from different view points. In this way it is possible to
face the problem of occlusions in crowding scenes.
In this work an object level motion detection algorithm is
used and it is applied to the videos sampled by two cameras.
The proposed approach does not require a calibration stage
and it does not introduce any constraints about the camera
positions. Once that the moving objects are detected, they are
characterized using image retrieval techniques.
The system was tested using two cameras. Object detection
and tracking are primary tasks in automatic video streaming
analysis. The obtained results in terms of correct classifications
rate seem to be encouraging because they highlight the ability
of the system to work also in presence of crowding scenes.
Keywords; correspondence problem; visual feature based
method, binocular vision, video surveillance system.
I. INTRODUCTION
Due to the rapid growth of the ICT, nowadays there is a
significant spread of digital devices able to capture good
quality videos. This fact allows the design and
implementation of video acquisition systems using many
cameras and often used in video surveillance applications.
One of the biggest challenge in this research field is to
develop a system that is able to automatically understand the
sampled videos and rise an alarm in real-time when a danger
situation is detected. In order to face this problem, in
literature there are various studies about automatic video
streaming analysis [1, 2, 3, 4]. These systems propose
different approaches to semantic video analysis but they
share the characteristic of working into the narrow domain.
For example the method proposed in [4] works for football
match videos while that proposed in [3] works for traffic
monitoring, etc.
Another characteristic shared among these systems is the
fact that they work analyzing the trajectories of some
relevant points (typically the barycentres) representing the
whole moving objects. For example, in [5] a system able to
recognize 47 actions performed by different individuals is
proposed. It analyzes the position of the hand barycentres.
All these systems have good performance but they fail in
discovering motion when the view direction is parallel to the
plane where the action is being performed. In other words,
since they use a single camera based viewpoint, they have a
bi-dimensional view of the world loosing the depth of field.
In order to face this problem, in literature the binocular
and multi-camera approaches have been proposed [6, 7, 8].
These systems work using two (or more) cameras viewing
the same scene from different view points. Using this
approach it is possible to appreciate the depth of field solving
the previous problem. Furthermore, it is possible to solve or
at least reduce the problem of occlusions in crowding scenes.
On the other hand, this approach introduces the
correspondence problem namely given a pixel or an object
into the image sampled by one camera, where is it in the
frame sampled by the other camera?
In literature various authors propose solutions to this
problem introducing some constraints. For example, a widely
applied constraint is that the acquisition system produces
stereo images [9, 10]. In this case, the images are taken by
two cameras with parallel optic axes and displaced
perpendicular to the axes.
Using stereo pairs of images it is possible to assume that:
stereo pairs are epipolar and the epipolar lines are
horizontally aligned, i.e., the correspondence points in the
two images lie along the same scan lines; the objects have
continuity in depth; there is a one-to-one mapping of an
image element between the two images (uniqueness); and
there is an ordering of the matchable points [11].
From a geometric point of view, the corresponding
problem can be successfully solved and the methods
proposed in literature achieve excellent performance when
applied to synthetic images. When these methods are applied
to real world images, the main issues to be solved are: noise
and illumination changes as a result of which the feature
values for the corresponding points in the two images can
differ; lack of unique match features in large regions;
occlusions, and half occlusions. The wider used methods to
solve this problem are: area based [10], feature based [12],
bayesan network [13], neural networks [11, 14], etc.
The video surveillance systems using binocular (or multi-
camera) vision use two different approaches to solve this
978-1-4244-8075-3/10/$26.00 ©2010 IEEE