Rich Speech Retrieval Using Query Word Filter Christian Wartena Univ. of Applied Sciences and Arts Hannover ∗ Hannover, Germany christian.wartena@fh-hannover.de Martha Larson Delft University of Technology Delft, the Netherlands m.a.larson@tudelft.nl ABSTRACT Rich Speech Retrieval performance improves when general query-language words are filtered and both speech recogni- tion transcripts and metadata are indexed via BM25F(ields). Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Content Anal- ysis and Indexing General Terms Algorithms, Experimentation Keywords Spoken content retrieval, Query word classification 1. INTRODUCTION Our Rich Speech Retrieval (RSR) approach filters words in the query into two categories and treats each separately. RSR is a known-item task that involves returning a ranked list of jump-in points in response to a user query describ- ing a segment of video in which someone is speaking. The queries are given in two formulations: a long form consisting of a natural language description of what the known item is about (ca. one sentence in length) and a short form consist- ing of a keyword version of the query as it might be issued to a general-purpose search engine. The video corpus used contains Creative Commons content collected from blip.tv and the spoken channel is a mixture of planned and spon- taneous speech. Although visual features might prove help- ful for some RSR queries, here, we investigate only the use of ASR-transcripts and metadata. Note that although the know-items targeted in the RSR task correspond to particu- lar speech acts, we did not investigate this aspect here. More details on the RSR task are available in [4]. We conjecture that users queries are a mixture of two dis- tinct types of language: general query language and primary language. General query language is language the users al- ways use when formulating queries for videos during a search session with a general search engine (e.g., video, episode, show ). Our conjecture is based on informal observation of user query behavior. It is supported by a user study of ∗ At the time the work presented here was done the author was affiliated with Novay, Enschede (The Netherlands) and Delft University of Technology. Copyright is held by the author/owner(s). MediaEval 2011 Workshop, September 1–2, 2011, Pisa, Italy podcast search behavior [1] during which subjects reported adding general words such as ‘podcast’, ‘audio’ or ‘mp3’ to queries when looking for podcasts using a general search engine. Primary language is query language that echos the words of the person who is speaking in the relevant video seg- ment. We assume that automatic speech recognition (ASR) transcripts will help us match primary language in queries with jump-in points, but that general query language found in ASR-transcripts is less likely to be specifically relevant to the user’s information need. We describe each of our algo- rithms, report results and end with conclusion and outlook. 2. EXPERIMENTAL FRAMEWORK In this section, we describe our approaches to RSR. For all runs, we produce our ranked list of jump-in points using a standard IR algorithm to retrieve video fragments that have been defined on the basis of the ASR-transcripts. We return the start point of each fragment as a jump-in point. Fragments are defined as a sequence of sentences of about 40 non-stop-words. Sentences are derived on the basis of punc- tuation (full-stop = sentence end), which is hypothesized by the recognizer and included in the output of ASR-system. If a sentence is less than 40 words in length, subsequent sen- tences are added until it approximately meets this target. Mark Hepple’s [2] part-of-speech (POS) tagger is used to tag and lemmatize all words. We remove all closed class words (i.e., prepositions, articles, auxiliaries, particles, etc.). To compensate for POS tagging errors, we additionally re- move English and Dutch stop words (standard Lucene search engine stopword lists). Word and sentence segmentation, POS-tagging and term selection are implemented as a UIMA (http://uima.apache.org) analysis pipeline. We carry out ranking using BM25 [5]. Since fragments may overlap, we calculate idf (Eq. 1) on the basis of the sentence, the basic organizational unit of the speech channel, idf(t) = log N - dft +0.5 dft +0.5 . (1) Here, N is the total number of fragments, and dft is the number of fragments in which term t occurs. The weight of each term in each fragment-document is given by w(d, t), w(d, t) = idf(t) (k + 1) * f dt f dt + k * (1 - b + b * l d avgdl ) , (2) where f dt is the number of occurrences of term t in document d, l d is the length of d, and avgdl is the average document length. In our experiments, we set k = 2 and b =0.75.