Abstract Advances in speech recognition technology have shown encouraging results for spoken document retrieval where the average precision often approaches 70% of that achieved for perfect text transcriptions. Typical applications of spoken document retrieval pertain to retrieval of stories from archived video/audio assets. In the CueVideo project, our application focus is spoken document retrieval from a video database for just-in-time training/distributed learning. Typical content is not pre-segmented, has no predefined structure, is of varying audio quality, and may not have domain specific data available. For such content, we propose a two level search, namely, a first level search across the entire video collection, and a second level search within a specific video. At both search levels, we perform an experimental evaluation of a combination of new and existing query expansion methods, intended to offset retrieval errors due to misrecognition. Keywords: Spoken document retrieval, query expansion 1. Introduction Companies and universities today are faced with the challenge of improving their education and training process to provide “education-on-demand” or “distributed learning” [11, 19]. High speed networks together with converging standards in streaming media servers provide the web-based infrastructure for such applications. However, there remains a need for automated indexing and retrieval techniques for video. In the CueVideo [1, 18] project, we are working on automated indexing techniques that support simultaneous access to information in multiple media, and the ability to cross-link related information. The objective is to automatically produce world wide web presentations that can be treated as “education modules”. In this paper, we focus on one particular component generally referred to as spoken document retrieval (SDR), where speech recognition is used to index and retrieve relevant information in audio/video [5, 9]. The field of spoken document retrieval has been largely dominated by the NIST funded SDR track of the TREC series of conferences [2, 8, 14]. These systems frequently use standard speech evaluation data such as broadcast radio and TV news programs as their test data set. Document boundaries are manually segmented so as to generate “stories” and the task on hand is to retrieve the correct “story”. This is necessary to benchmark speech recognition and information retrieval (IR) systems in competitive evaluations. However, in the context of distributed learning, the problem of retrieval is somewhat different. Video is one of the essential components of the relevant material in an education module. A small collection (of the order of 10’s) of videos may belong to an education module related to a topic, and retrieval within an education module must span across this collection. No document boundary information within videos is available. A typical web-based query is three words or less [6], unlike the 15 word query length in standard evaluations. Therefore, we partition the problem of SDR for distributed learning into two levels: The first is to arrive at the equivalent of Yahoo or Alta Vista for the video collection, where the unit of retrieval is a single video. The second is to arrive at the equivalent of Control-F (find Query Expansion for Imperfect Speech: Applications in Distributed Learning maheshv@watson.ibm.com savitha, dulce, petkovic@almaden.ibm.com Yorktown Heights, NY 10598 USA San Jose Ca 95120-6099, USA Route 134 650 Harry Road IBM T. J. Watson Research Center IBM Almaden Research Center Mahesh Viswanathan Savitha Srinivasan, Dulce Ponceleon, Dragutin Petkovic 0-7695-0695-X/00 $10.00 (c) 2000 IEEE