Enhanced Language Modeling for Extractive Speech Summarization with Sentence Relatedness Information Shih-Hung Liu 1, 2 , Kuan-Yu Chen 1, 2 , Yu-Lun Hsieh 1 , Berlin Chen 3 , Hsin-Min Wang 1 , Hsu-Chun Yen 2 , Wen-Lian Hsu 1 1 Institute of Information Science, Academia Sinica, Taiwan 2 National Taiwan University, Taiwan 3 National Taiwan Normal University, Taiwan E-mail: 1 {journey, kychen, morphe, whm, hsu}@iis.sinica.edu.tw, 2 yen@cc.ee.ntu.edu.tw, 3 berlin@ntnu.edu.tw Abstract Extractive summarization is intended to automatically select a set of representative sentences from a text or spoken document that can concisely express the most important topics of the document. Language modeling (LM) has been proven to be a promising framework for performing extractive summarization in an unsupervised manner. However, there remain two fundamental challenges facing existing LM-based methods. One is how to construct sentence models involved in the LM framework more accurately without resorting to external information sources. The other is how to additionally take into account the sentence-level structural relationships embedded in a document for important sentence selection. To address these two challenges, in this paper we explore a novel approach that generates overlapped clusters to extract sentence relatedness information from the document to be summarized, which can be used not only to enhance the estimation of various sentence models but also to allow for the sentence- level structural relationships for better summarization performance. Further, the utilities of our proposed methods and several state-of-the-art unsupervised methods are analyzed and compared extensively. A series of experiments conducted on a Mandarin broadcast news summarization task demonstrate the effectiveness and viability of our method. Index Terms: speech summarization, language modeling, clustering, relevance, sentence relatedness 1. Introduction Due in large part to the advances in automatic speech recognition and the popularity as well as ubiquity of multimedia associated with spoken documents [1, 2], research on speech summarization have attracted increasing interest in the speech processing community over the past decade [3-6]. Extractive speech summarization aims at producing a concise summary by selecting salient sentences or paragraphs from an original spoken document according to a predefined target summarization ratio. By doing so, it can indicate the important speech segments with their corresponding transcripts in the document for users to listen to and digest. The various extractive summarization methods that have been developed so far may roughly fall into three major categories [5-7]: 1) methods simply based on sentence structural or locational cues, 2) methods based on unsupervised statistical measures, and 3) methods based on supervised sentence classification. For the first category, the important sentences can be selected from some salient parts of a spoken document [8]. As an illustration, sentences can be selected from the introductory and/or concluding parts. However, such methods can be only applied to some limited domains or document structures. On the other hand, extractive text or speech summarization using unsupervised statistical measures attempts to select salient sentences on top of some statistical features of sentences, or of the words in the sentences, in an unsupervised manner. Statistical features derived for each sentence, can be word frequency, linguistic score calculated from lexical and/or semantic cues, recognition confidence, similarity (proximity) measure and prosodic information, and so forth. The associated unsupervised summarization methods based on these features has garnered much research and may be further grouped into three subcategories: 1) the vector-based methods, which include, but are not limited to, the vector space model (VSM) [9], the latent semantic analysis (LSA) [9], and the maximum marginal relevance (MMR) method [10]; 2) the graph-based methods, which include, among others, the Markov random walk (MRW) [11], the LexRank [12], and the minimum dominating set algorithm [13]; 3) the combinatorial optimization-based methods, which include the submodularity-based method (Submodularity) [14] and the integer linear programming (ILP) method [15], to name just a few. Aside from that, a number of supervised classification- based methods using various kinds of indicative features also have been developed, such as the Gaussian mixture models (GMM) [16], the Bayesian classifier (BC) [17], the support vector machine (SVM) [18] and the conditional random fields (CRFs) [19]. In these methods, important sentence selection is usually formulated as a binary classification problem. A sentence can either be included in a summary or not. These classification-based methods need a set of training documents along with their corresponding handcrafted summaries (or labeled data) for training the classifiers (or summarizers). However, manual annotation is both time-consuming and labor-intensive. Even if the performance of unsupervised summarizers is not always comparable to that of supervised summarizers, their easy-to-implement and portable property still makes them attractive. Interested readers may also refer to [5-7, 20] for comprehensive and enjoyable discussions of major methods that have been successfully developed and applied to a wide variety of text and speech summarization tasks. Orthogonal to the aforementioned summarization methods, a more recent line of research is devoted to capitalizing on the statistical language modeling (LM) framework in an unsupervised manner for important sentence selection [21-24], which has been applied to extractive speech summarization with preliminary success. However, there are still two fundamental challenges facing the existing LM-based methods. One is how to estimate sentence models involved in the LM framework more accurately without resorting to external information sources. The other is how to additionally take into account the sentence-level structural relationships embedded in a document to be summarized for important sentence selection. To address these two challenges, in this paper we explore a novel approach that generates overlapped clusters to Copyright  2014 ISCA 14 - 18 September 2014, Singapore INTERSPEECH 2014 1865