AUTOMATIC KEYWORD EXTRACTION FOR THE MEETING CORPUS USING SUPERVISED APPROACH AND BIGRAM EXPANSION Fei Liu, Feifan Liu, Yang Liu Department of Computer Science The University of Texas at Dallas {feiliu, fiu, yangl}@hlt.utdallas.edu ABSTRACT In this paper, we tackle the problem of automatic keyword extraction in the meeting domain, a genre signicantly different from written text. For the supervised framework, we proposed a rich set of fea- tures beyond the typical TFIDF measures, such as sentence salience weight, lexical features, summary sentences, and speaker informa- tion. We also evaluate different candidate sampling approaches for better model training and testing. In addition, we introduced a bi- gram expansion module which aims at extracting “entity bigrams” using Web resources. Using the ICSI meeting corpus, we demon- strate the effectiveness of the features and show that the supervised method and the bigram expansion module outperform the unsuper- vised TFIDF selection with POS (part-of-speech) ltering. Finally, we show the approaches introduced in this paper perform well on the speech recognition output. Index Termskeyword extraction, meeting transcripts, TFIDF, feature selection 1. INTRODUCTION Keywords can provide important information about the content of documents. However, pre-annotated keywords are often not avail- able for spoken documents, such as meeting transcripts. Recent re- search has focused on a few meeting understanding tasks (such as summarization, topic segmentation, browsing), but not much on au- tomatic keyword extraction. There has been various previous work on keyword extraction, primarily on different text domains. TFIDF-based selection has been widely used [1, 2]. It is computationally efcient and performs rea- sonably well. Keyword extraction has also been treated as a super- vised learning problem [1, 2, 3], where a classier is used to classify candidate words into positive or negative instances using a set of fea- tures. Other research for keyword extraction has also taken advan- tage of semantic resources [4], Web-based metric, such as PMI score (point-wise mutual information) [3], or graph-based algorithms (e.g., [5] that attempted to use a reinforcement approach to do keyword ex- traction and summarization simultaneously). Meeting speech is intrinsically different from written text. For example, there are typically multiple participants in the meeting, the discussion is not well organized, the speech is spontaneous and con- tains disuencies and ill-formed sentences. Whether existing ap- proaches can be successfully applied to this domain is a question. In this paper, based on the previous keyword extraction work, we propose a supervised approach to automatic extraction of keywords for meeting transcripts. Features that have been found useful for text domain are not necessarily useful or maybe even unavailable for the meeting genre, such as the title or structural information like para- graphs. We utilize a rich set of well-motivated features for this task, such as lexical features to represent whether the sentence is a de- cision making sentence, the relationship between the keywords and the summary sentences. We perform feature selection to evaluate the effectiveness of various features, as well as sampling to select word candidates. In addition, we introduce a bigram expansion module that uses Google to extract “entity bigrams”. Our method signi- cantly outperforms the TFIDF baseline with POS ltering. The same improvement is also observed when using the recognition output. 2. KEYWORD EXTRACTION APPROACHES Our task is to extract keywords for each of the topic segments in the meeting transcript. Therefore by “document”, we mean a topic segment in the following of the paper. 2.1. Supervised Framework In the supervised approach, a maximum entropy (MaxEnt) classier is used to determine whether a unigram word is a keyword (binary classication). Each candidate word is represented by a variety of features, explained below. (A) Features Used TFIDF. These include: TF, IDF, and TFIDF. The term fre- quency (TF) for a word wi in a document is the number of times the word occurs in the document. The inverse docu- ment frequency (IDF) value is log(N/Ni ), where Ni denotes the number of the documents containing word wi , and N is the total number of the documents in the collection. Position features. These features represent where a word rst appears, dened as its position normalized by the total num- ber of words or sentences in the document, referred as ‘dis- word’ and ‘dis-sent’ respectively. Stopword features. We generate a stopword list by sorting all the words in an increasing order of their IDF values. Three bi- nary features are dened: ‘sw-200’, ‘sw-300’, and ‘sw-500’, to denote whether a candidate word is on the top 200, 300, and 500 of the list. A word with a low IDF means it occurs in many documents and is not topic indicative. Sentence features. These are extracted from the sentences containing the word. Feature ‘sent-score’ is the salience score of a sentence, calculated based on its Cosine similarity to the entire meeting under the vector space model. Fea- ture ‘sent-len’ is the length of the sentence. If a candidate word appears in several sentences, we empirically use the maximum length and the highest salience score among those sentences. Lexical features. These include three feature classes: lex-prp, lex-jj, lex-context. We notice that keywords often appear in 181 978-1-4244-3472-5/08/$25.00 ©2008 IEEE SLT 2008 Authorized licensed use limited to: IEEE Xplore. Downloaded on February 26, 2009 at 13:53 from IEEE Xplore. Restrictions apply.