An Efficient Gradient Computation Approach to Discriminative Fusion Optimization in Semantic Concept Detection Chengyuan Ma and Chin-Hui Lee School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA 30332, USA {cyma, chl}@ece.gatech.edu Abstract In this paper, we propose an efficient gradient com- putation approach for discriminative fusion optimiza- tion in TRECVID high-level feature extraction. Numer- ical approximation was exploited in gradient calcula- tion and model parameter update. The gradient of the performance measure was approximated by a sum of in- stance point-wise gradient instead of instance pair-wise gradient used in maximum figure-of-merit learning such that performance metrics like average precision can be optimized directly and efficiently on large training set. Experiments on the TRECVID 2005 high-level feature extraction test set showed that the proposed algorithm can improve the mean average precision from 0.254 of a state-of-the-art baseline system to 0.285. 1. Introduction Detectors are basic units of a detection-based au- tomatic video analysis system. They can be imple- mented at different levels such as face detectors, anchor detectors, speech/music detectors and so on. Among them, semantic concept detectors are of great impor- tance. They help us to bridge the semantic gap between the low-level visual features and the high-level seman- tics. They also facilitate more intuitive indexing, re- trieval and navigation of broadcast news video at the semantic level. In addition, the dynamic patterns of se- mantic concepts reveal the structure of a broadcast news video story. Semantic concept detection, also known as high-level feature extraction in TRECVID becomes more challenging due to small imbalanced training data and large variation in examples. Many research efforts and progress have been promoted by the TRECVID benchmark evaluations [1] [2]. Two fundamental parts of semantic concept detec- tion are suitable feature representations and optimal fu- sion strategies. Various visual features such as color, texture and edges are widely used. In addition, the bag-of-visual-words (BoW) representation based on lo- cal appearance features [3] and latent semantic index- ing (LSI) feature representation have been explored [4]. Both early and late fusion strategies have been inves- tigated. To properly weigh each low-level visual fea- ture, late fusion (evidence fusion) strategies were shown to perform favorably when compared to early fusion [2] [5]. Some other post-processing frameworks have been proposed to exploit the contextual relationship and temporal dependency [6]. For instance, a discrimina- tive fusion strategy based on maximum figure-of-merit (MFoM) was proposed to combine multi-class and bi- nary concept classifiers [4]. One problem of many conventional fusion strategies is that the optimization criteria in the fusion step are dif- ferent or inconsistent with the performance metrics used for performance evaluation. For instance, in TRECVID high-level feature extraction evaluation, average pre- cision (AP) and mean average precision (MAP) are widely used. However, many fusion strategies aim at minimizing the probability of error or maximizing the likelihood. So it is likely to produce suboptimal sys- tems. This problem becomes more severe when dealing with small amount of imbalanced training data. Sev- eral discriminative fusion strategies have been investi- gated to optimize the recall, precision, accuracy and F measure on the training set [7] [8]. In addition, some approaches like an ensemble learning framework and support vector method were proposed to optimize the area under the receiver operating characteristic (ROC) curve [9] [10] [11]. However, when working on large training data set, optimizing AP and the area under the ROC curve require heavy computations at each itera- tion. In this paper, we proposed an efficient gradient computation approach for discriminative fusion opti- mization such that the model parameters can be esti- 978-1-4244-2175-6/08/$25.00 ©2008 IEEE