A semi-automatic video annotation tool with MPEG-7 content collections Roberto Vezzani, Costantino Grana, Daniele Bulgarelli, Rita Cucchiara DII - Università degli Studi di Modena e Reggio Emilia {surname.name}@unimore.it Abstract In this work, we present a general purpose system for hierarchical structural segmentation and automatic annotation of video clips, by means of standardized low level features. We propose to automatically extract some prototypes for each class with a context based intra-class clustering. Clips are annotated following the MPEG-7 standard directives to provide easier portability. Results of automatic annotation and semi- automatic metadata creation are provided. 1. Introduction The increasing spread of Video Digital Libraries calls for the design of efficient Video Data Management Systems to manage video access, provide summarization, similarity search, and support queries according with available annotations. Video summaries are necessary to provide compressed representations of videos without losing crucial contents and to allow efficient browsing as well as a fast overview of the original contents by dropping the time spent on tedious operations such as fast forwarding and rewind. Examples of automatic semantic annotation systems have been presented recently, most of them in the application domain of news and sports video. Most of the proposals deal with a specific context making use of ad-hoc features. In [1] the playfield area, the number and the placement of players on the play field, and motion cues are used to distinguish soccer highlights into subclasses. Differently, a first approach trying to apply general features is described by [2]. Employing color, texture, motion, and shape, visual queries by sketches are provided, supporting automatic object based indexing and spatiotemporal queries. We propose a general framework which allows to automatically annotating video clips by comparing their similarity to a domain specific set of prototypes. In particular, we focus on providing a flexible system directly applicable to different contexts and a standardized MPEG-7 output. To this aim, the clip characterizing features, the final video annotation, and the storage of the reference video objects and classes are realized using this standard. Starting from a large set of manually annotated clips, according with a classification scheme, the system exploits the potential perceptual regularity and generates a set of prototypes, or visual concepts, by means of a intra-class clustering procedure. Then, only the prototypes are stored as suitable specialization concepts of the defined classes. Thanks to the massive use of the MPEG-7 standard, a remote system could then perform its own annotation of videos using these context classifiers. 2. Similarity of video clips The problem of clip similarity can be seen as a generalization of the problem of image similarity: as for images, each clip may be described by a set of visual features, such as color, shape and motion. These are grouped in a feature vector 1 2 , , , N i i i i F F F = ⎡ ⎤ ⎣ ⎦ V … where i is the frame number, N is number of features and j i F is the j-th feature computed at frame i. However, extracting a feature vector at each frame can lead to some problems during the similarity computation between clips, since they may have different lengths; at the same time keeping a single feature vector for the whole clip cannot be representative enough, because it does not take into account the features temporal variability. Here, a fixed number M of feature vectors is used for each clip, computed on M frames sampled at uniform intervals within the clip. In our experiments, a good tradeoff between efficacy and computational load suggests the use of 5 M = for clips of averaging 100 frames. To provide a general purpose system, we avoid to select context dependent features, relaying on broad range properties of the clips. To allow easier interoperability