An Efficient Gradient Computation Approach to Discriminative Fusion
Optimization in Semantic Concept Detection
Chengyuan Ma and Chin-Hui Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology, Atlanta, GA 30332, USA
{cyma, chl}@ece.gatech.edu
Abstract
In this paper, we propose an efficient gradient com-
putation approach for discriminative fusion optimiza-
tion in TRECVID high-level feature extraction. Numer-
ical approximation was exploited in gradient calcula-
tion and model parameter update. The gradient of the
performance measure was approximated by a sum of in-
stance point-wise gradient instead of instance pair-wise
gradient used in maximum figure-of-merit learning such
that performance metrics like average precision can be
optimized directly and efficiently on large training set.
Experiments on the TRECVID 2005 high-level feature
extraction test set showed that the proposed algorithm
can improve the mean average precision from 0.254 of
a state-of-the-art baseline system to 0.285.
1. Introduction
Detectors are basic units of a detection-based au-
tomatic video analysis system. They can be imple-
mented at different levels such as face detectors, anchor
detectors, speech/music detectors and so on. Among
them, semantic concept detectors are of great impor-
tance. They help us to bridge the semantic gap between
the low-level visual features and the high-level seman-
tics. They also facilitate more intuitive indexing, re-
trieval and navigation of broadcast news video at the
semantic level. In addition, the dynamic patterns of se-
mantic concepts reveal the structure of a broadcast news
video story. Semantic concept detection, also known
as high-level feature extraction in TRECVID becomes
more challenging due to small imbalanced training data
and large variation in examples. Many research efforts
and progress have been promoted by the TRECVID
benchmark evaluations [1] [2].
Two fundamental parts of semantic concept detec-
tion are suitable feature representations and optimal fu-
sion strategies. Various visual features such as color,
texture and edges are widely used. In addition, the
bag-of-visual-words (BoW) representation based on lo-
cal appearance features [3] and latent semantic index-
ing (LSI) feature representation have been explored [4].
Both early and late fusion strategies have been inves-
tigated. To properly weigh each low-level visual fea-
ture, late fusion (evidence fusion) strategies were shown
to perform favorably when compared to early fusion
[2] [5]. Some other post-processing frameworks have
been proposed to exploit the contextual relationship and
temporal dependency [6]. For instance, a discrimina-
tive fusion strategy based on maximum figure-of-merit
(MFoM) was proposed to combine multi-class and bi-
nary concept classifiers [4].
One problem of many conventional fusion strategies
is that the optimization criteria in the fusion step are dif-
ferent or inconsistent with the performance metrics used
for performance evaluation. For instance, in TRECVID
high-level feature extraction evaluation, average pre-
cision (AP) and mean average precision (MAP) are
widely used. However, many fusion strategies aim at
minimizing the probability of error or maximizing the
likelihood. So it is likely to produce suboptimal sys-
tems. This problem becomes more severe when dealing
with small amount of imbalanced training data. Sev-
eral discriminative fusion strategies have been investi-
gated to optimize the recall, precision, accuracy and F
measure on the training set [7] [8]. In addition, some
approaches like an ensemble learning framework and
support vector method were proposed to optimize the
area under the receiver operating characteristic (ROC)
curve [9] [10] [11]. However, when working on large
training data set, optimizing AP and the area under the
ROC curve require heavy computations at each itera-
tion. In this paper, we proposed an efficient gradient
computation approach for discriminative fusion opti-
mization such that the model parameters can be esti-
978-1-4244-2175-6/08/$25.00 ©2008 IEEE