AUTOMATIC KEYWORD EXTRACTION FOR THE MEETING CORPUS USING
SUPERVISED APPROACH AND BIGRAM EXPANSION
Fei Liu, Feifan Liu, Yang Liu
Department of Computer Science
The University of Texas at Dallas
{feiliu, ffliu, yangl}@hlt.utdallas.edu
ABSTRACT
In this paper, we tackle the problem of automatic keyword extraction
in the meeting domain, a genre significantly different from written
text. For the supervised framework, we proposed a rich set of fea-
tures beyond the typical TFIDF measures, such as sentence salience
weight, lexical features, summary sentences, and speaker informa-
tion. We also evaluate different candidate sampling approaches for
better model training and testing. In addition, we introduced a bi-
gram expansion module which aims at extracting “entity bigrams”
using Web resources. Using the ICSI meeting corpus, we demon-
strate the effectiveness of the features and show that the supervised
method and the bigram expansion module outperform the unsuper-
vised TFIDF selection with POS (part-of-speech) filtering. Finally,
we show the approaches introduced in this paper perform well on the
speech recognition output.
Index Terms— keyword extraction, meeting transcripts, TFIDF,
feature selection
1. INTRODUCTION
Keywords can provide important information about the content of
documents. However, pre-annotated keywords are often not avail-
able for spoken documents, such as meeting transcripts. Recent re-
search has focused on a few meeting understanding tasks (such as
summarization, topic segmentation, browsing), but not much on au-
tomatic keyword extraction.
There has been various previous work on keyword extraction,
primarily on different text domains. TFIDF-based selection has been
widely used [1, 2]. It is computationally efficient and performs rea-
sonably well. Keyword extraction has also been treated as a super-
vised learning problem [1, 2, 3], where a classifier is used to classify
candidate words into positive or negative instances using a set of fea-
tures. Other research for keyword extraction has also taken advan-
tage of semantic resources [4], Web-based metric, such as PMI score
(point-wise mutual information) [3], or graph-based algorithms (e.g.,
[5] that attempted to use a reinforcement approach to do keyword ex-
traction and summarization simultaneously).
Meeting speech is intrinsically different from written text. For
example, there are typically multiple participants in the meeting, the
discussion is not well organized, the speech is spontaneous and con-
tains disfluencies and ill-formed sentences. Whether existing ap-
proaches can be successfully applied to this domain is a question.
In this paper, based on the previous keyword extraction work, we
propose a supervised approach to automatic extraction of keywords
for meeting transcripts. Features that have been found useful for text
domain are not necessarily useful or maybe even unavailable for the
meeting genre, such as the title or structural information like para-
graphs. We utilize a rich set of well-motivated features for this task,
such as lexical features to represent whether the sentence is a de-
cision making sentence, the relationship between the keywords and
the summary sentences. We perform feature selection to evaluate the
effectiveness of various features, as well as sampling to select word
candidates. In addition, we introduce a bigram expansion module
that uses Google to extract “entity bigrams”. Our method signifi-
cantly outperforms the TFIDF baseline with POS filtering. The same
improvement is also observed when using the recognition output.
2. KEYWORD EXTRACTION APPROACHES
Our task is to extract keywords for each of the topic segments in
the meeting transcript. Therefore by “document”, we mean a topic
segment in the following of the paper.
2.1. Supervised Framework
In the supervised approach, a maximum entropy (MaxEnt) classifier
is used to determine whether a unigram word is a keyword (binary
classification). Each candidate word is represented by a variety of
features, explained below.
(A) Features Used
• TFIDF. These include: TF, IDF, and TFIDF. The term fre-
quency (TF) for a word wi in a document is the number of
times the word occurs in the document. The inverse docu-
ment frequency (IDF) value is log(N/Ni ), where Ni denotes
the number of the documents containing word wi , and N is
the total number of the documents in the collection.
• Position features. These features represent where a word first
appears, defined as its position normalized by the total num-
ber of words or sentences in the document, referred as ‘dis-
word’ and ‘dis-sent’ respectively.
• Stopword features. We generate a stopword list by sorting all
the words in an increasing order of their IDF values. Three bi-
nary features are defined: ‘sw-200’, ‘sw-300’, and ‘sw-500’,
to denote whether a candidate word is on the top 200, 300,
and 500 of the list. A word with a low IDF means it occurs in
many documents and is not topic indicative.
• Sentence features. These are extracted from the sentences
containing the word. Feature ‘sent-score’ is the salience
score of a sentence, calculated based on its Cosine similarity
to the entire meeting under the vector space model. Fea-
ture ‘sent-len’ is the length of the sentence. If a candidate
word appears in several sentences, we empirically use the
maximum length and the highest salience score among those
sentences.
• Lexical features. These include three feature classes: lex-prp,
lex-jj, lex-context. We notice that keywords often appear in
181 978-1-4244-3472-5/08/$25.00 ©2008 IEEE SLT 2008
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 26, 2009 at 13:53 from IEEE Xplore. Restrictions apply.