TIME INTERVAL MAXIMUM ENTROPY BASED EVENT INDEXING IN SOCCER VIDEO Cees G.M. Snoek and Marcel Worring Intelligent Sensory Information Systems, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands {cgmsnoek, worring}@science.uva.nl ABSTRACT Multimodal indexing of events in video documents poses problems with respect to representation, inclusion of con- textual information, and synchronization of the heteroge- neous information sources involved. In this paper we present the Time Interval Maximum Entropy (TIME) framework that tackles aforementioned problems. To demonstrate the viability of TIME for event classification in multimodal video, an evaluation was performed on the domain of soccer broad- casts. It was found that by applying TIME, the amount of video a user has to watch in order to see almost all highlights can be reduced considerably. 1. INTRODUCTION Effective and efficient extraction of semantic indexes from video documents requires simultaneous analysis of visual, auditory, and textual information sources. In literature sev- eral of such methods have been proposed, addressing dif- ferent types of semantic indexes, see [12] for an extensive overview. Multimodal methods for detection of semantic events are still rare, notable exceptions are [3, 7, 8, 10]. For the integration of the heterogeneous data sources a statisti- cal classifier gives the best results [12], compared to heuris- tic methods, e.g. [3]. In particular, instances of the Dynamic Bayesian Network (DBN) framework, e.g. [8, 10]. Draw- backs of the DBN framework are the fact that the model works with fixed common units, e.g. image frames, thereby ignoring differences in layout schemes of the modalities, and thus proper synchronization. Secondly, it is difficult to model several asynchronous temporal context relations si- multaneously. Finally, it lacks satisfactory inclusion of the textual modality. Some limitations are overcome by using a maximum entropy framework. Which has been successfully applied in diverse research disciplines, including the area of statis- tical natural language processing, where it achieved state- of-the-art performance [4]. More recently it was also re- ported in video indexing literature [7], indicating promising This research is sponsored by the ICES/KIS MIA project and TNO. results for the purpose of highlight classification in base- ball. However, the presented method lacks synchroniza- tion of multimodal information sources. We propose the Time Interval Maximum Entropy (TIME) framework that extends the standard framework with time interval relations, to allow proper inclusion of multimodal data, synchroniza- tion, and context relations. To demonstrate the viability of TIME for detection of semantic events in multimodal video documents, we evaluated the method on the domain of soccer broadcasts. Other methods using this domain exist, e.g. [2, 14]. We improve on this existing work by exploiting multimodal, instead of unimodal, information sources, and by using a classifier based on statistics instead of heuristics. The rest of this paper is organized as follows. We first introduce event representation in the TIME framework. Then we proceed with the basics of the maximum entropy classi- fier in section 3. In section 4 we discuss the classification of events in soccer video, and the features used. Experiments are presented in section 5. 2. VIDEO EVENT REPRESENTATION We view the problem of event detection in video as a pat- tern recognition problem, where the task is to assign to a pattern x an event or category ω, based on a set of n features (f 1 ,f 2 ,...,f n ) derived from x. We now consider how to represent a pattern. A multimodal video document is composed of differ- ent modalities, each with their own layout and content el- ements. Therefore, features have to be defined on layout specific segments. Hence, synchronization is required. To illustrate, consider figure 1. In this example a video doc- ument is represented by five time dependent features de- fined on different asynchronous time scales. At a certain moment an event occurs. Clues for the occurrence of this event are found in the features that have a value within the time-window of the event, but also in contextual features that have a value before or after, the actual occurrence of the event. As an example consider a goal in a soccer match. Clues that indicate this event are a swift camera pan towards the goal area before the goal, an excited commentator dur-