An Image Retrieval Based Solution for Correspondence Problem in Binocular Vision Alberto Amato * , Vincenzo Piuri DTI - Università degli Studi di Milano Via Bramante 65, 26013 Crema (CR) Italy {alberto.amato, vincenzo.piuri}@unimi.it Vincenzo Di Lecce DIASS - Politecnico di Bari Via Alcide De Gasperi, 74100 Taranto (TA) Italy v.dilecce@aeflab.net Abstract— Aim of this paper is to propose a solution to the correspondence problem in multi-camera systems. In these systems, two or more cameras are used to record the same scene from different view points. In this way it is possible to face the problem of occlusions in crowding scenes. In this work an object level motion detection algorithm is used and it is applied to the videos sampled by two cameras. The proposed approach does not require a calibration stage and it does not introduce any constraints about the camera positions. Once that the moving objects are detected, they are characterized using image retrieval techniques. The system was tested using two cameras. Object detection and tracking are primary tasks in automatic video streaming analysis. The obtained results in terms of correct classifications rate seem to be encouraging because they highlight the ability of the system to work also in presence of crowding scenes. Keywords; correspondence problem; visual feature based method, binocular vision, video surveillance system. I. INTRODUCTION Due to the rapid growth of the ICT, nowadays there is a significant spread of digital devices able to capture good quality videos. This fact allows the design and implementation of video acquisition systems using many cameras and often used in video surveillance applications. One of the biggest challenge in this research field is to develop a system that is able to automatically understand the sampled videos and rise an alarm in real-time when a danger situation is detected. In order to face this problem, in literature there are various studies about automatic video streaming analysis [1, 2, 3, 4]. These systems propose different approaches to semantic video analysis but they share the characteristic of working into the narrow domain. For example the method proposed in [4] works for football match videos while that proposed in [3] works for traffic monitoring, etc. Another characteristic shared among these systems is the fact that they work analyzing the trajectories of some relevant points (typically the barycentres) representing the whole moving objects. For example, in [5] a system able to recognize 47 actions performed by different individuals is proposed. It analyzes the position of the hand barycentres. All these systems have good performance but they fail in discovering motion when the view direction is parallel to the plane where the action is being performed. In other words, since they use a single camera based viewpoint, they have a bi-dimensional view of the world loosing the depth of field. In order to face this problem, in literature the binocular and multi-camera approaches have been proposed [6, 7, 8]. These systems work using two (or more) cameras viewing the same scene from different view points. Using this approach it is possible to appreciate the depth of field solving the previous problem. Furthermore, it is possible to solve or at least reduce the problem of occlusions in crowding scenes. On the other hand, this approach introduces the correspondence problem namely given a pixel or an object into the image sampled by one camera, where is it in the frame sampled by the other camera? In literature various authors propose solutions to this problem introducing some constraints. For example, a widely applied constraint is that the acquisition system produces stereo images [9, 10]. In this case, the images are taken by two cameras with parallel optic axes and displaced perpendicular to the axes. Using stereo pairs of images it is possible to assume that: stereo pairs are epipolar and the epipolar lines are horizontally aligned, i.e., the correspondence points in the two images lie along the same scan lines; the objects have continuity in depth; there is a one-to-one mapping of an image element between the two images (uniqueness); and there is an ordering of the matchable points [11]. From a geometric point of view, the corresponding problem can be successfully solved and the methods proposed in literature achieve excellent performance when applied to synthetic images. When these methods are applied to real world images, the main issues to be solved are: noise and illumination changes as a result of which the feature values for the corresponding points in the two images can differ; lack of unique match features in large regions; occlusions, and half occlusions. The wider used methods to solve this problem are: area based [10], feature based [12], bayesan network [13], neural networks [11, 14], etc. The video surveillance systems using binocular (or multi- camera) vision use two different approaches to solve this 978-1-4244-8075-3/10/$26.00 ©2010 IEEE