Unsupervised Tracking of Stereoscopic Video Objects Employing Neural Networks Retraining Anastasios D. Doulamis, Klimis S. Ntalianis, Nikolaos D. Doulamis, Kostas Karpouzis and Stefanos D. Kollias National Technical University of Athens Electrical and Computer Engineering Department 9, Heroon Polytechniou str. Zografou 15773, Athens, Greece E-mail: (adoulam, kntal, ndoulam)@image.ntua.gr Abstract A novel approach is presented in this paper for improving the performance of neural network classifiers in video object tracking applications, based on a retraining procedure at the user level. The procedure includes (a) a retraining algorithm for adapting the network weights to the current conditions, (b) semantically meaningful object extraction which plays the role of the retraining set and (c) a decision mechanism for determining when network retraining should be activated. The retraining algorithm takes into consideration both the former and the current network knowledge in order to achieve good generalization and reduce retraining time. Object extraction is accomplished by utilizing depth information, provided by stereoscopic video and incorporating a multiresolution implementation of the Recursive Shortest Spanning Tree (RSST) segmentation algorithm. Finally the decision mechanism in this framework depends on a scene change detection algorithm. Results are presented which illustrate the performance of the proposed approach in real life experiments. Keywords: neural network, tracking, stereoscopic video sequneces 1 Introduction The success of the new emerging multimedia applications, such as video editing, content-based image retrieval, video summarization, object- dependent transmission and video surveillance depends on the development of new sophisticated algorithms for efficient description, segmentation and representation of the visual content [1]. Such a content-based approach offers a new range of capabilities in terms of access, identification and manipulation of the visual information [2]. In particular, a) it provides high compression ratios by allowing the encoder to place more emphasis on objects of interest [2],[3]. b) It offers multimedia capabilities and interactivity since an object can be handled independently and c) It facilitates sophisticated video queries and content-based retrieval operations on image/video databases [4]. The MPEG- 4 standard introduced the concept of Video Objects (VOs) for content-oriented description and coding of video sequences. Each VO consists of arbitrarily shaped regions with different color, texture or motion. So content-based segmentation remains a challenging task for many applications apart, perhaps, from the case of video sequences produced in a studio environment using the chroma-key technology. In stereoscopic video, however, the problem of content- based segmentation can be addressed more effectively. This is due to the fact that depth information can be estimated more reliably and provides an efficient content description, since a video object is usually located on a specific depth plane [5]. Furthermore, neural networks have not played a significant role in the development of video coding standards, such as MPEG-1 and MPEG-2. Nevertheless their superior non-linear classification abilities can make neural networks a major analysis tool in the forthcoming multimedia oriented standards (MPEG-4 and MPEG-7). Several techniques and algorithms have been proposed in the literature for image segmentation and tracking. Some color-oriented methods have been recently proposed based on the morphological watershed [6] or by using split and merge techniques [7]. However, an intrinsic property of video-objects is that they usually consist of regions of totally different color characteristics and consequently the main