A GROUND TRUTH FOR MOTION-BASED VIDEO-OBJECT SEGMENTATION
*
Fabrizio Tiburzi, Marcos Escudero, Jesús Bescós and José M. Martínez
Grupo de Tratamiento de Imágenes, Escuela Politécnica Superior
Universidad Autónoma de Madrid, E-28049 Madrid, Spain
e-mail: {fabricio.tiburzi, marcos.escudero, j.bescos, josem.martinez}@uam.es
*
Work supported by the Spanish Government (TEC2007-65400 - SemanticVideo), the European Commission (IST-FP6-027685 - Mesh)
and the Comunidad de Madrid (S-0505/TIC-0223 - ProMultiDis-CM).
ABSTRACT
This paper describes the design procedure followed to
generate a ground truth for the evaluation of motion-based
algorithms for video-object segmentation. A thorough
review and classification of the critical factors that affect the
behavior of segmentation algorithms results in a set of video
scripts which have then been filmed. Foreground objects
have been recorded in a chroma studio, in order to
automatically obtain pixel-level high quality segmentation
masks for each generated sequence. The resulting corpus
(segmentation ground-.truth plus filmed sequences mounted
over different backgrounds) is available for research
purposes under a license agreement.
Index Terms— Segmentation ground-truth, video
segmentation, object segmentation, video corpus,.
1. INTRODUCTION
Video object segmentation has gradually drawn the
attention of many researchers during last years. In parallel,
interest on reliable strategies to assess the quality of this
segmentation has also grown. Available techniques for this
purpose can be divided, under a broad categorization, into
subjective[1] and objective[2][3] approaches. Subjective
evaluation depends on human assesing for the evaluation of
the segmentation quality, therefore preventing this approach
from being applied when the number of needed evaluations
is large.
Regarding to objetive evaluation, the most common
approach relies on an ideal segmentation reference –the so-
called ground-truth– which is used for comparison. This
comparison can be done either by means of pixel error
classification ratios or using the many additional metrics
proposed in the literature [2][3]. These metrics take into
account object properties such as shape, color or texture,
thereby assessing mask similarity from a more semantically-
meaningful point of view.
The precise location of frame objects required for
ground-truth relative evaluation must be extracted either
manually or via some reliable procedure. At any rate, it
normally requires considerable amount of human effort.
Futhermore, objects (faces, abandoned entities, moving
cars...) are highly dependent on the addressed application,
thus making a generic ground-truth unfeasible for the
evaluation of any segmentation algorithm. Due to the
important number of works adopting motion as a feature to
discriminate relevant objects from background, we have
focused our work on the evaluation of algorithms based on
this criterion. However, instead of deepening on convenient
metrics assuming ground-truth information is available, as
most works do, in this paper we confront, from a rigorous
perspective, a ground-truth generation process.
The structure of this paper is as follows. Section 2
presents a number of design considerations necessary to
achieve a representative set of video-sequences from a
motion segmentation point of view. The sequences
definition and recording procedure are discussed in sections
3 and 4 and the download procedure and some examples are
provided in section 5.
2. GROUND-TRUTH DESIGN: CRITICAL FACTORS
IN MOTION-BASED SEGMENTATION
In order for a ground-truth to yield meaningful evaluation
results, it should include a set of representative video
sequences, ranging from low to high complexity situations.
The term “complexity” will be used hereinafter to express
the degree of difficulty for a particular segmentation
algorithm to yield accurate results.
Motion-based segmentation algorithms are generally
based either on some pixel intensity models and, especially
when dealing with moving cameras, on some optical flow
estimation of the scene. Starting from these ideas global
sequence complexity has been found to be strongly
dependent on a series of specific properties of objects, on
background complexity, on camera motion and on some
relationships among these elements. These dependencies
have been designated as “critical factors” (emphasizing their
influence on the algorithms’ results). Since specific settings
17 978-1-4244-1764-3/08/$25.00 ©2008 IEEE ICIP 2008
Authorized licensed use limited to: Univ Autonoma de Madrid. Downloaded on June 4, 2009 at 10:07 from IEEE Xplore. Restrictions apply.