USING N-BEST RECOGNITION OUTPUT FOR EXTRACTIVE SUMMARIZATION AND KEYWORD EXTRACTION IN MEETING SPEECH Yang Liu, Shasha Xie, Fei Liu The University of Texas at Dallas, Richardson, TX, USA {yangl,shasha,feiliu}@hlt.utdallas.edu ABSTRACT There has been increasing interest recently in meeting understand- ing, such as summarization, browsing, action item detection, and topic segmentation. However, there is very limited effort on us- ing rich recognition output (e.g., recognition confidence measure or more recognition candidates) for these downstream tasks. This pa- per presents an initial study using n-best recognition hypotheses for two tasks, extractive summarization and keyword extraction. We ex- tend the approach used on 1-best output to n-best hypotheses: MMR (maximum marginal relevance) for summarization and TFIDF (term frequency, inverse document frequency) weighting for keyword ex- traction. Our experiments on the ICSI meeting corpus demonstrate promising improvement using n-best hypotheses over 1-best output. These results suggest worthy future studies using n-best or lattices as the interface between speech recognition and downstream tasks. Index Terms— summarization, keyword extraction, n-best hy- potheses 1. INTRODUCTION Meetings happen all the time in the world. If we can record this data and apply speech and language technology to these recordings, it will greatly help efficient information management. Recently there have been many efforts on various meeting understanding tasks, such as automatic summarization, meeting browsing, detecting decision parts and action items, topic segmentation, keyword extraction, and dialog act tagging. However, most previous work used reference transcripts. Some studies used speech recognition (ASR) output, but so far they have only used 1-best ASR hypothesis. ASR errors are known to degrade performance for many down- stream tasks, therefore research has been conducted to use richer information from speech recognizers (such as n-best list, lattices or confusion network, and confidence measure from recognition out- put) for various tasks, including speech translation, spoken docu- ment retrieval, named entity recognition, among others. Compared to those tasks, there is no prior work as yet for the meeting under- standing tasks mentioned above. In this paper, we investigate using n-best output for two meeting processing tasks: extractive summarization and keyword extraction. We extend the approach used for 1-best to n-best hypotheses. Our experiments show that (1) using more hypotheses yields better per- formance for both tasks, but performance levels off after some hy- potheses, and (2) the best result using n-best lists is still significantly worse than using human transcripts. These suggest more future stud- ies are needed to incorporate information from multiple candidates as well as confidence measures from recognizers. 2. RELATED WORK Many techniques have been proposed for meeting summarization. Some used unsupervised approaches and relied on textual informa- tion only, such as Maximum Marginal Relevance (MMR), latent semantic analysis, and integer linear programming [1, 2, 3]. Oth- ers were based on supervised methods, such as maximum entropy model, SVM, conditional random fields, using lexical, structural, and acoustic features [4, 5, 6]. Some previous work only used human transcripts. Others used ASR output and typically reported perfor- mance degradation due to recognition errors. Note that a lot of work on speech summarization has been performed using other domains, such as lectures, broadcast news, voice mails, etc. We only men- tioned some work above in meeting domain. Most of previous keyword extraction work has been done on written text domain, often based on information such as frequency, word association, sentence/document structure or position, and lin- guistic knowledge. These are either modeled using unsupervised or supervised methods. Compared to text domain, there have been very limited studies on speech data. [7] evaluated the performance of the tool “Extractor” on broadcast news transcripts with various quality. [8] compared two lexical resources, WordNet and EDR electronic dictionary, to extract simple noun keywords from mul- tiparty meeting corpus. [9, 10] investigated unsupervised and su- pervised approaches for keyword extraction on the meeting domain, and showed using ASR output hurts system performance compared to human transcripts. Various studies have been conducted on coupling ASR and lan- guage processing tasks. Loose coupling is the most widely used in- terface. It is a one-way pipeline, where ASR output is used as input to subsequent language processing components. The interface can be 1-best, n-best, lattices, or confusion network. For machine trans- lation, [11] used n-best list for reranking by optimizing interpolation weights for ASR and translation, and [12] used confusion network, but without much improvement over n-best. Lattices have been stud- ied intensively for spoken document retrieval and indexing [13, 14], with reported better performance than just using 1-best ASR output. For named entity recognition on speech data, [15] used n-best lists and obtained small improvement. There is also study trying to more tightly couple ASR and language processing components. For exam- ple, [16] proposed a joint decoding approach for speech translation. However, these systems are often too complex and hard to optimize, and do not always outperform those using loose coupling. There are a few prior studies related to speech summarization and keyword extraction using information beyond just 1-best ASR output, but none of the work is on meeting domain. [7] reported some improved results using additional hypotheses in n-best list for keyword extraction. [17] performed topic clustering using confi- dence scores, which resulted in better clusters and indirectly helped summarization. Very recently, [18] used confusion networks and ex- pected word counts for speech summarization in Mandarin broadcast news, and achieved better performance than 1-best ASR output. In this paper, our goal is to leverage more ASR hypotheses for summa- rization and keyword extraction on the meeting data.