1-4244-0983-7/07/$25.00 ©2007 IEEE ICICS 2007
SVM-Based Decision Fusion Model for Detecting
Concepts in Films
P. Muneesawang
Department of Electrical and Computer Engineering
Naresuan University
Phisanulok, Thailand 65000
paisarnmu@nu.ac.th
L. Guan
Depart. of Electrical and Computer Engineering
Ryerson University
Toronto, Canada M5B 2E8
lguan@ee.ryerson.ca
Abstract—This paper studies a support vector machine (SVM)
to obtain a decision fusion algorithm for detection of semantic
concepts in videos, and its application to Films database. Given a
movie clip, its spatio-temporal information is captured by
audiovisual features. These are then independently inputted to
the corresponding matching experts whose outputs are fused at
the decision stage by the SVM classifier. Based on our simulation
results, this fusion method can attain very high recognition
accuracy for detection of various concepts from a collection of
Hollywood movies. It requires a very small set of training
samples from a large database.
Keywords—SVM decision fusion, semantic concept detection,
movie retrieval, audiovisual indexing
I. INTRODUCTION
The application of multimodality signal processing has
been demonstrated to help solve challenge problems in
multimedia databases, such as story segmentation, concept
detection, retrieval, and topic clustering. Much of the previous
works have been focused mainly on news videos and sports
domain applications. A key challenge here is to integrate the
different sources of evidence from many multimedia features
into indexing that helps system effectively find what user
seeks. Upon the nature of news videos, the text information
from speech transcription and closed caption can be exploited.
This is fused together with audio-visual features showing
promising performance in processing for broadcast news video
domain. Some of the recent research works on this topic are
discussed in [1]. For sport videos, although they are of the
same type, sport videos represent different genres and require
specific approaches for detecting useful attributes. These
include the following techniques that detect events (e.g., goals
and penalty) in baseball [2], cricket [3], and tennis [4], which
typically yield successful results within the targeted domain. In
addition, the detection task requires a more ‘generic approach’
that is common to all genres of sports (e.g., [5]-[7]).
The methods discussed so far are specified either to news
and sports videos, and relatively little prior works addressed the
related problems for movie domain application. Central to all
these works are complex algorithms, performing standalone
modeling of specific events, based on intrinsically critical
characteristic features, which tend to be particular to each video
type. Their effectiveness is somewhat diluted by their inherent
inapplicability to other video genres. In addition, a more
generic, genre independent methodology is the more
challenging and difficult task. For a given event detection task,
it is unfeasible to consider that there exists a unique solution
that will operate successfully across all genres of video. We
propose in this paper to address the problems for movie domain
application where only few previous works have been done.
Rassheed et al [8] has worked on the classification of movies
into broad categories: Comedies, Action, Dramas, or Horror
films. Inspired by cinematic principles, computable features
such as motion content and lighting key are applied to map a
movie into semantic classes.
Combining different modalities allows to alleviate
problems intrinsic to single modalities, and a fusion algorithm,
which combines the different modalities, is a very critical part
of the recognition system [9]. In the current work, the audio
and visual features are fused by a learning module to
characterize concepts. The learning module is implemented by
a passive learning process using support vector machine for
detection of concepts according to pre-defined classes. The
SVM-based decision fusion has been demonstrated in other
application domains including cartridge identification [10] as
well as person identity verification [11]. We propose an
adoption of SVM to obtain fusion algorithm at the decision
stage for characterization of concepts. Based on our
experimental results the proposed system deploying perception
features extracted from audiovisual data, together with the
SVM-based decision fusion models offered very high
recognition accuracy and required a small set of training data
when applied to a large database of Hollywood movies.
II. SVM-BASED DECISION FUSION
A. Fusion model
Figure 1 shows a diagram of the proposed fusion model
using SVM as a classifier. The extracted data (audio and
visual) are processed by different matching experts: an audio
similarity matching expert and a visual similarity matching
expert. Each expert, given the extracted data, will delivers a
matching score in the range between zero (reject) and one
(accept). These scores are not binary decisions. The SVM
fusion module will combine the opinions of the different
experts and give a binary decision. When combining 2
modules, the fusion algorithm process 2-dimentional vectors
whose each component is a matching score in [0,1] delivered
by the corresponding modality expert. This paper will address
the issue of the fusion model using two types of features.
Authorized licensed use limited to: NARESUAN UNIVERSITY. Downloaded on January 4, 2010 at 23:08 from IEEE Xplore. Restrictions apply.