1-4244-0983-7/07/$25.00 ©2007 IEEE ICICS 2007 SVM-Based Decision Fusion Model for Detecting Concepts in Films P. Muneesawang Department of Electrical and Computer Engineering Naresuan University Phisanulok, Thailand 65000 paisarnmu@nu.ac.th L. Guan Depart. of Electrical and Computer Engineering Ryerson University Toronto, Canada M5B 2E8 lguan@ee.ryerson.ca Abstract—This paper studies a support vector machine (SVM) to obtain a decision fusion algorithm for detection of semantic concepts in videos, and its application to Films database. Given a movie clip, its spatio-temporal information is captured by audiovisual features. These are then independently inputted to the corresponding matching experts whose outputs are fused at the decision stage by the SVM classifier. Based on our simulation results, this fusion method can attain very high recognition accuracy for detection of various concepts from a collection of Hollywood movies. It requires a very small set of training samples from a large database. Keywords—SVM decision fusion, semantic concept detection, movie retrieval, audiovisual indexing I. INTRODUCTION The application of multimodality signal processing has been demonstrated to help solve challenge problems in multimedia databases, such as story segmentation, concept detection, retrieval, and topic clustering. Much of the previous works have been focused mainly on news videos and sports domain applications. A key challenge here is to integrate the different sources of evidence from many multimedia features into indexing that helps system effectively find what user seeks. Upon the nature of news videos, the text information from speech transcription and closed caption can be exploited. This is fused together with audio-visual features showing promising performance in processing for broadcast news video domain. Some of the recent research works on this topic are discussed in [1]. For sport videos, although they are of the same type, sport videos represent different genres and require specific approaches for detecting useful attributes. These include the following techniques that detect events (e.g., goals and penalty) in baseball [2], cricket [3], and tennis [4], which typically yield successful results within the targeted domain. In addition, the detection task requires a more ‘generic approach’ that is common to all genres of sports (e.g., [5]-[7]). The methods discussed so far are specified either to news and sports videos, and relatively little prior works addressed the related problems for movie domain application. Central to all these works are complex algorithms, performing standalone modeling of specific events, based on intrinsically critical characteristic features, which tend to be particular to each video type. Their effectiveness is somewhat diluted by their inherent inapplicability to other video genres. In addition, a more generic, genre independent methodology is the more challenging and difficult task. For a given event detection task, it is unfeasible to consider that there exists a unique solution that will operate successfully across all genres of video. We propose in this paper to address the problems for movie domain application where only few previous works have been done. Rassheed et al [8] has worked on the classification of movies into broad categories: Comedies, Action, Dramas, or Horror films. Inspired by cinematic principles, computable features such as motion content and lighting key are applied to map a movie into semantic classes. Combining different modalities allows to alleviate problems intrinsic to single modalities, and a fusion algorithm, which combines the different modalities, is a very critical part of the recognition system [9]. In the current work, the audio and visual features are fused by a learning module to characterize concepts. The learning module is implemented by a passive learning process using support vector machine for detection of concepts according to pre-defined classes. The SVM-based decision fusion has been demonstrated in other application domains including cartridge identification [10] as well as person identity verification [11]. We propose an adoption of SVM to obtain fusion algorithm at the decision stage for characterization of concepts. Based on our experimental results the proposed system deploying perception features extracted from audiovisual data, together with the SVM-based decision fusion models offered very high recognition accuracy and required a small set of training data when applied to a large database of Hollywood movies. II. SVM-BASED DECISION FUSION A. Fusion model Figure 1 shows a diagram of the proposed fusion model using SVM as a classifier. The extracted data (audio and visual) are processed by different matching experts: an audio similarity matching expert and a visual similarity matching expert. Each expert, given the extracted data, will delivers a matching score in the range between zero (reject) and one (accept). These scores are not binary decisions. The SVM fusion module will combine the opinions of the different experts and give a binary decision. When combining 2 modules, the fusion algorithm process 2-dimentional vectors whose each component is a matching score in [0,1] delivered by the corresponding modality expert. This paper will address the issue of the fusion model using two types of features. Authorized licensed use limited to: NARESUAN UNIVERSITY. Downloaded on January 4, 2010 at 23:08 from IEEE Xplore. Restrictions apply.