LATENT SEMANTIC INDEXING FOR SEMANTIC CONTENT DETECTION OF VIDEO SHOTS Fabrice Souvannavong, Bernard Merialdo and Benoˆ ıt Huet Departement Communications Multim´ edias Institut Eurecom 2229 route des Crˆ etes 06904 Sophia-Antipolis - France e-mail: souvanna, merialdo, huet @eurecom.fr ABSTRACT Low-level features are now becoming insufficient to build effi- cient content-based retrieval systems. The interest of users is not anymore to retrieve visually similar content, but they expect that retrieval systems find documents with similar semantic content. Bridging the gap between low-level features and semantic content is a challenging task necessary for future retrieval systems. Latent Semantic Indexing (LSI) was successfully introduced to efficiently index text documents. In this paper we propose to adapt this tech- nique to efficiently represent the visual content of video shots for semantic content detection. Although we restrict our approach to visual features, it can be extended with minor changes to audio and motion features to build a multi-modal system. The semantic con- tent is then detected thanks to two classifiers: k-nearest neighbors and neural network classifiers. Finally, in the experimental section we show the performances of each classifier and the performance gain obtained with LSI features compared to traditional features. 1. INTRODUCTION Because of the growth of numerical storage facilities, many doc- uments are now archived in huge databases or extensively shared over the Internet. The advantage of such mass storage is undeni- able. However the challenging tasks of multimedia content index- ing and retrieval remain unsolved without the expensive human intervention to archive and annotate contents. Many researchers are currently investigating methods to automatically analyze, or- ganize, index and retrieve video information [1, 2]. This effort is further stressed by the emerging Mpeg-7 standard that provides a rich and common description tool of multimedia contents. It is also encouraged by Video-TREC which aims at developing video content analysis and retrieval. Currently, one of the main challenges in the field of image and video retrieval is to automatically bridge the gap from low- level visual features to the semantic content. Since three years, TREC 1 has been setting up a new track to encourage research and development in the domain of video content analysis, indexing and retrieval. In particular, one of the proposed task is the extraction of This research was supported by the EU project GMF4iTV under the IST-programme (IST-2001-34861) 1 Text REtrieval Conference. Its purpose is to support research within the information retrieval community by providing the infrastructure neces- sary for large-scale evaluation. http://trec.nist.gov semantic features, like people, indoors, news subject, , in video shots. We propose a system to efficiently index visual features in or- der to extract the semantic content of video shots. The first step is conducted with an adaptation of Latent Semantic Indexing (LSI) to image or video content. LSI has been proven effective for text doc- ument analysis, indexing and retrieval [3]. Some extensions to au- dio and image features were then proposed [4, 5]. The adaptation we present models video shots in a similar way as text documents. Key frames of shots are described by the occurrence of a set of predefined region types. The underlying idea is that each region of an image carries a semantic information that influences the seman- tic content of the whole shot. In [6], authors propose a statistical model to map image regions to keywords in order to annotate the complete image. In this paper, we study the occurrence of regions in many shots to build efficient signatures of shots. Obtained sig- natures contain the most informative part of each shot that is used to detect its semantic content. The second step, i.e. the semantic analysis, is achieved thanks to the well-known k-nearest neighbors and neural network classifiers. The advantage of k-nearest neigh- bors classifiers resides in their independency with respect to data distribution, while neural network classifiers take advantage of la- bel correlation in the context of multi-label classification. The next section presents our adaptation of Latent Semantic Indexing to video shots. Next we present the k-nearest neighbors and neural network classifiers. Then we set up the experimental framework to discuss results and compare LSI to traditional fea- tures. Finally we conclude with a brief summary and future work. 2. LATENT SEMANTIC INDEXING In the field of text document analysis, Latent Semantic Indexing (LSI) is a theory and method for extracting and representing the contextual meaning of words by statistical computations applied to a large corpus of text. The underlying idea is that the aggre- gate of all the contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSI’s reflection of human knowledge has been established in a variety of ways [7]. For example, its scores over- lap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word-word and passage-word lexical priming data; and it accurately estimates passage coherence, learnability of passages