Unsupervised Semantic Object Segmentation of Stereoscopic Video Sequences Anastasios D. Doulamis, Nikolaos D. Doulamis, Klimis S. Ntalianis and Stefanos D. Kollias Electrical and Computer Engineering Department National Technical University of Athens Zografou 15773, Athens, Greece E-mail: adoulam@image.ntua.gr Abstract In this paper, we present an efficient technique for unsupervised semantically meaningful object segmentation of stereoscopic video sequences. By this technique we achieve to extract semantic objects using the additional information a stereoscopic pair of frames provides. Each pair is analyzed and the disparity field, occluded areas and depth map are estimated. The key algorithm, which is applied on the stereo pair of images and performs the segmentation, is a powerful low-complexity multiresolution implementation of the RSST algorithm. Color segment fusion is employed using the depth segments as a kind of constraints. Finally experimental results are presented which demonstrate the high-quality of semantic object segmentation this technique achieves. 1. Introduction Work carried out in literature has led to the well known standards MPEG-1, MPEG-2, H.261 and H.263 which are block-based approaches to the video analysis and coding. However as multimedia applications and content-based interactivity had a great increase in popularity the last decade, new standards have recently emerged as MPEG-4 and MPEG-7. The MPEG-4 standard brings a revolution to the video and image coding and representation by introducing the concept of video object planes (VOP's). A VOP describes a semantically meaningful object in a frame and recently many efforts have been done in the direction of extracting the semantic visual information. Although humans can identify semantic entities effortlessly, this remains a fundamental research problem in the image analysis community. On the other hand, the use of three-dimensional (3-D) video provides very efficient visual representation and spectacular experiences to viewers. Virtual reality applications will attract much more attention in the next decades and the objects that will be combined to produce virtual worlds, would be real life objects and not animations or graphics. Additionally as stereoscopic video introduces new methods of handling and manipulating video objects, it will probably prevail over the conventional 2-D representations. These methods basically exploit depth information, which is provided by the perspective projection of 3-D points onto two 2-D image planes. These two image planes are defined by the planes where the two cameras of the stereoscopic system are placed. Several techniques and algorithms have been proposed for video segmentation in the past [1], [2], [3], each of which has its particular features and applications. In general, the segmentation can be categorized into three main classes; segmentation based on color, based on motion and based on depth . The methods of the third class can be easily applied especially to 3-D sequences. The technique in [4] uses 3-D watershed transformation to segment and track objects. The method in [5] is based on motion segmentation where a dense motion field is estimated and the scene is segmented according to motion information. Other techniques use gradient information of parts of an image to locate deformable whole boundaries [6]. In addition to these methods homogeneity is the corner stone for region based approaches [7]. Although the aforementioned techniques work well in situations where data are simple and models fit well, they lack generality and the results obtained are far from perfect. The problem of semantic segmentation depends on several factors: a VOP can contain multiple colors some times very similar