MULTIMODAL VIDEO SEARCH TECHNIQUES: LATE FUSION OF SPEECH-BASED RETRIEVAL AND VISUAL CONTENT-BASED RETRIEVAL A. Amir†, G. Iyengar‡, C-Y. Lin§, M. Naphade§, A. Natsev§, C. Neti‡, H. J. Nock‡, J. R. Smith§, B. Tseng§ ‡IBM TJ Watson Research Center, YorktownHeights, NY, USA §IBM TJ Watson Research Center, Hawthorne, NY, USA †IBM Almaden Research Center, San Jose, CA, USA ABSTRACT There has been extensive research into systems for content-based or text-based (e.g. closed captioning, speech transcript) search, some of which has been applied to video. However, the 2001 and 2002 NIST TRECVID benchmarks of broadcast video search systems showed that designing multimodal video search systems which integrate both speech and image (or image sequence) cues, and thereby improve performance beyond that achievable by sys- tems using only speech or image cues, remains a challenging prob- lem. This paper describes multimodal systems for ad-hoc search constructed by IBM for the TRECVID 2003 benchmark of search systems for broadcast video. These multimodal ad-hoc search sys- tems all use a late fusion of independently developed speech-based and visual content-based retrieval systems and outperform our in- dividual speech-based and content-based retrieval systems on both manual and interactive search tasks. For the manual task, our best system used a query-dependent linear weighting between speech- based and image-based retrieval systems. This system has Mean Average Precision (MAP) performance 20% above our best uni- modal system for manual search. For the interactive task, where the user has full knowledge of the query topic and the performance of the individual search systems, our best system used an inter- lacing approach. The user determines the (subjectively) optimal weights A and B for the speech-based and image-based systems, where the multimodal result set is aggregated by combining the top A documents from system A followed by top B documents of system B and then repeating this process until the desired result set size is achieved. This multimodal interactive search has MAP 40% above our best unimodal interactive search system. 1. INTRODUCTION Multimedia information retrieval (including video search) has tra- ditionally been approached independently by the text and spoken document processing community (which uses only degraded text, e.g. closed captioning, spoken words, visual text) and by the video processing community (which uses only visual information). It seems reasonable to hypothesize that robust solutions for multi- media retrieval might be more easily obtained by multimodal video search techniques which utilize all information in the component modalities of multimedia data (including images, speech and non- speech audio) rather than individual modalities alone. While some queries can be answered respectably using speech transcripts alone (e.g. “ﬁnd stories on topic X”) and others answered acceptably us- ing images alone (e.g. “ﬁnd basketball games”, “baseball games” or “nuclear mushroom clouds”), it seems plausible that perfor- mance could be improved by search techniques exploiting cues in both text and image(s). An extreme illustration is the subset of queries which can only be answered by systems making use of multiple modalities (e.g. “ﬁnd shots in which Yasser Arafat is speaking in front of the Wailing Wall”). This hypothesis about the potential gains achievable through multimodal search techniques represents our belief that the individual modalities carry highly complementary information and should therefore be exploited in tandem to improve search performance 1 . Despite this, and despite the extensive research which has been expended on systems for in- dependent content-based or text-based retrieval, the 2001 and 2002 NIST benchmark tests of systems for searching broadcast video showed that the problem of integrating information from multi- ple modalities within a single multimodal video retrieval system remained a challenge: for example, for manual search, unimodal (typically speech-based) systems were amongst the top results (see e.g. [2], [3]). This paper describes multimodal systems constructed by IBM for the TRECVID 2003 benchmark of search systems for broadcast video. These systems all use a late combination or late fusion of independently developed speech-based and visual content-based retrieval systems, as will be described. In contrast to the trend seen by multiple benchmark groups in 2001 and 2002, in which multimodal systems often performed less well than unimodal (e.g. speech-only) systems, these multimodal systems outperform our individual speech-based and content-based retrieval systems on both manual and interactive search tasks 2 . Paper organization is as follows. Sections 2 and 3 discuss the (independently developed and tuned) unimodal content-based and speech-based retrieval sys- tems. Sections 4 and 5 discuss techniques for late integration of these unimodal systems into multimodal systems for manual and interactive search. Section 6 presents experimental results. The paper ends with conclusions and future work. 1 Evidence of the usefulness of complementary information sources in developing robust solutions can be drawn from areas such as audio-visual speech recognition, in which the complementarity of multiple information sources has been exploited to obtain more robust solutions [1]. 2 In manual search, as speciﬁed by NIST guidelines, the user interprets the statement of information need and formulates a query. The user does not see the search corpus and gets exactly one attempt to launch the query on the search system. In interactive search, the user can interact with the system based on intermediate results. The user can reﬁne the query, select and provide positive and negative feedback to the system and such. In both cases, guidelines state query formulation (and interaction, if applicable) must take less than 15 minutes. III - 1048 0-7803-8484-9/04/$20.00 ©2004 IEEE ICASSP 2004 ➠ ➡