MULTIMODAL VIDEO SEARCH TECHNIQUES: LATE FUSION OF SPEECH-BASED
RETRIEVAL AND VISUAL CONTENT-BASED RETRIEVAL
A. Amir†, G. Iyengar‡, C-Y. Lin§, M. Naphade§, A. Natsev§, C. Neti‡, H. J. Nock‡, J. R. Smith§, B. Tseng§
‡IBM TJ Watson Research Center, YorktownHeights, NY, USA
§IBM TJ Watson Research Center, Hawthorne, NY, USA
†IBM Almaden Research Center, San Jose, CA, USA
ABSTRACT
There has been extensive research into systems for content-based
or text-based (e.g. closed captioning, speech transcript) search,
some of which has been applied to video. However, the 2001
and 2002 NIST TRECVID benchmarks of broadcast video search
systems showed that designing multimodal video search systems
which integrate both speech and image (or image sequence) cues,
and thereby improve performance beyond that achievable by sys-
tems using only speech or image cues, remains a challenging prob-
lem. This paper describes multimodal systems for ad-hoc search
constructed by IBM for the TRECVID 2003 benchmark of search
systems for broadcast video. These multimodal ad-hoc search sys-
tems all use a late fusion of independently developed speech-based
and visual content-based retrieval systems and outperform our in-
dividual speech-based and content-based retrieval systems on both
manual and interactive search tasks. For the manual task, our best
system used a query-dependent linear weighting between speech-
based and image-based retrieval systems. This system has Mean
Average Precision (MAP) performance 20% above our best uni-
modal system for manual search. For the interactive task, where
the user has full knowledge of the query topic and the performance
of the individual search systems, our best system used an inter-
lacing approach. The user determines the (subjectively) optimal
weights A and B for the speech-based and image-based systems,
where the multimodal result set is aggregated by combining the
top A documents from system A followed by top B documents of
system B and then repeating this process until the desired result
set size is achieved. This multimodal interactive search has MAP
40% above our best unimodal interactive search system.
1. INTRODUCTION
Multimedia information retrieval (including video search) has tra-
ditionally been approached independently by the text and spoken
document processing community (which uses only degraded text,
e.g. closed captioning, spoken words, visual text) and by the video
processing community (which uses only visual information). It
seems reasonable to hypothesize that robust solutions for multi-
media retrieval might be more easily obtained by multimodal video
search techniques which utilize all information in the component
modalities of multimedia data (including images, speech and non-
speech audio) rather than individual modalities alone. While some
queries can be answered respectably using speech transcripts alone
(e.g. “find stories on topic X”) and others answered acceptably us-
ing images alone (e.g. “find basketball games”, “baseball games”
or “nuclear mushroom clouds”), it seems plausible that perfor-
mance could be improved by search techniques exploiting cues
in both text and image(s). An extreme illustration is the subset
of queries which can only be answered by systems making use
of multiple modalities (e.g. “find shots in which Yasser Arafat is
speaking in front of the Wailing Wall”). This hypothesis about the
potential gains achievable through multimodal search techniques
represents our belief that the individual modalities carry highly
complementary information and should therefore be exploited in
tandem to improve search performance
1
. Despite this, and despite
the extensive research which has been expended on systems for in-
dependent content-based or text-based retrieval, the 2001 and 2002
NIST benchmark tests of systems for searching broadcast video
showed that the problem of integrating information from multi-
ple modalities within a single multimodal video retrieval system
remained a challenge: for example, for manual search, unimodal
(typically speech-based) systems were amongst the top results (see
e.g. [2], [3]).
This paper describes multimodal systems constructed by IBM for
the TRECVID 2003 benchmark of search systems for broadcast
video. These systems all use a late combination or late fusion of
independently developed speech-based and visual content-based
retrieval systems, as will be described. In contrast to the trend
seen by multiple benchmark groups in 2001 and 2002, in which
multimodal systems often performed less well than unimodal (e.g.
speech-only) systems, these multimodal systems outperform our
individual speech-based and content-based retrieval systems on
both manual and interactive search tasks
2
. Paper organization is
as follows. Sections 2 and 3 discuss the (independently developed
and tuned) unimodal content-based and speech-based retrieval sys-
tems. Sections 4 and 5 discuss techniques for late integration of
these unimodal systems into multimodal systems for manual and
interactive search. Section 6 presents experimental results. The
paper ends with conclusions and future work.
1
Evidence of the usefulness of complementary information sources in
developing robust solutions can be drawn from areas such as audio-visual
speech recognition, in which the complementarity of multiple information
sources has been exploited to obtain more robust solutions [1].
2
In manual search, as specified by NIST guidelines, the user interprets
the statement of information need and formulates a query. The user does
not see the search corpus and gets exactly one attempt to launch the query
on the search system. In interactive search, the user can interact with the
system based on intermediate results. The user can refine the query, select
and provide positive and negative feedback to the system and such. In both
cases, guidelines state query formulation (and interaction, if applicable)
must take less than 15 minutes.
III - 1048 0-7803-8484-9/04/$20.00 ©2004 IEEE ICASSP 2004
➠ ➡