A GROUND TRUTH FOR MOTION-BASED VIDEO-OBJECT SEGMENTATION * Fabrizio Tiburzi, Marcos Escudero, Jesús Bescós and José M. Martínez Grupo de Tratamiento de Imágenes, Escuela Politécnica Superior Universidad Autónoma de Madrid, E-28049 Madrid, Spain e-mail: {fabricio.tiburzi, marcos.escudero, j.bescos, josem.martinez}@uam.es * Work supported by the Spanish Government (TEC2007-65400 - SemanticVideo), the European Commission (IST-FP6-027685 - Mesh) and the Comunidad de Madrid (S-0505/TIC-0223 - ProMultiDis-CM). ABSTRACT This paper describes the design procedure followed to generate a ground truth for the evaluation of motion-based algorithms for video-object segmentation. A thorough review and classification of the critical factors that affect the behavior of segmentation algorithms results in a set of video scripts which have then been filmed. Foreground objects have been recorded in a chroma studio, in order to automatically obtain pixel-level high quality segmentation masks for each generated sequence. The resulting corpus (segmentation ground-.truth plus filmed sequences mounted over different backgrounds) is available for research purposes under a license agreement. Index Terms— Segmentation ground-truth, video segmentation, object segmentation, video corpus,. 1. INTRODUCTION Video object segmentation has gradually drawn the attention of many researchers during last years. In parallel, interest on reliable strategies to assess the quality of this segmentation has also grown. Available techniques for this purpose can be divided, under a broad categorization, into subjective[1] and objective[2][3] approaches. Subjective evaluation depends on human assesing for the evaluation of the segmentation quality, therefore preventing this approach from being applied when the number of needed evaluations is large. Regarding to objetive evaluation, the most common approach relies on an ideal segmentation reference –the so- called ground-truth– which is used for comparison. This comparison can be done either by means of pixel error classification ratios or using the many additional metrics proposed in the literature [2][3]. These metrics take into account object properties such as shape, color or texture, thereby assessing mask similarity from a more semantically- meaningful point of view. The precise location of frame objects required for ground-truth relative evaluation must be extracted either manually or via some reliable procedure. At any rate, it normally requires considerable amount of human effort. Futhermore, objects (faces, abandoned entities, moving cars...) are highly dependent on the addressed application, thus making a generic ground-truth unfeasible for the evaluation of any segmentation algorithm. Due to the important number of works adopting motion as a feature to discriminate relevant objects from background, we have focused our work on the evaluation of algorithms based on this criterion. However, instead of deepening on convenient metrics assuming ground-truth information is available, as most works do, in this paper we confront, from a rigorous perspective, a ground-truth generation process. The structure of this paper is as follows. Section 2 presents a number of design considerations necessary to achieve a representative set of video-sequences from a motion segmentation point of view. The sequences definition and recording procedure are discussed in sections 3 and 4 and the download procedure and some examples are provided in section 5. 2. GROUND-TRUTH DESIGN: CRITICAL FACTORS IN MOTION-BASED SEGMENTATION In order for a ground-truth to yield meaningful evaluation results, it should include a set of representative video sequences, ranging from low to high complexity situations. The term “complexity” will be used hereinafter to express the degree of difficulty for a particular segmentation algorithm to yield accurate results. Motion-based segmentation algorithms are generally based either on some pixel intensity models and, especially when dealing with moving cameras, on some optical flow estimation of the scene. Starting from these ideas global sequence complexity has been found to be strongly dependent on a series of specific properties of objects, on background complexity, on camera motion and on some relationships among these elements. These dependencies have been designated as “critical factors” (emphasizing their influence on the algorithms’ results). Since specific settings 17 978-1-4244-1764-3/08/$25.00 ©2008 IEEE ICIP 2008 Authorized licensed use limited to: Univ Autonoma de Madrid. Downloaded on June 4, 2009 at 10:07 from IEEE Xplore. Restrictions apply.