Automated Multi-document Summarization in NeATS C hin-Y ew Lin and Eduard Hov y USC Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292-6695 Tel: +1-310-448-8711/8731 {cyl,hovy}@isi.edu ABSTRACT This paper describes the multi-document text summarization system NeATS. Using a simple algorithm, NeATS was among the top two performers of the DUC-01 evaluation. Keywords Multi-document text summarization; NeATS 1. OVERVIEW In this paper we describe work in our NeATS ( Ne xt Generation A utomated T ext S ummarization) project on multi-document summarization. To select important content, we used techniques that proved effective in single document summarization such as sentence position [1], term frequency [13], topic signature [11,7], and term clustering. To remove redundancy, we used MMR [3]. To improve cohesion and coherence, we used stigma word filters [4] and time stamps. Although most of the individual techniques are not new, assembling them and applying them to multi-document summarization is new. Also, including lead sentences to ensure coherence is new, and turned out to be important. For much of the system we re-used modules built in prior work, notable the SUMMARIST single-document summarizer [7] and the Webclopedia question answering system [8,9]. NeATS was evaluated in the Document Understanding Conference DUC-01 [15]. It consistently was among the top performers in the multi-document summarization track. In the aggregating peer-to-peer comparison suggested by [14], NeATS scored first in precision and second in recall among 12 participants. The system also achieved the best F1-measure score across all summary sizes. 2. NeATS NeATS attempts to extract relevant or interesting portions from a set of documents about some topic and to present them in coherent order. It is tailored to the genre of newspaper news articles, and it works for English, but can be made multilingual without a great deal of effort. At present NeATS produces generic (author’s point of view) summaries, but it could be made sensitive to desired focus topics, input by a user. Given an input of a collection of sets of newspaper articles, NeATS applies the following 6 steps. 3. ALGORITHM 3.0 Input The input is a set of topic groups. Each topic group is a set of approx. 10 newspaper articles selected by the evaluation organizers. A topic group may focus on a single natural disaster (earthquake, hurricane, etc.), a single event (election, car race, etc.), multiple instances of a type of event (many earthquakes, elections, etc.), or a single person (in which case the summary would be a biography). 3.1 Extract and Rank Passages Given the input documents, form a query, extract sentences, and rank them, using modules of Webclopedia: 1.a identify key words for each topic group: compute unigram, bigram, and trigram topic signatures [11] for each group, using the likelihood ratio l [2]. A topic signature is a list of words/phrases, each with strengths, that characterizes the group and differentiates it from others [7] 1.b to facilitate fallback (query generalization), remove from the signatures all words or phrases that occur in fewer than half the texts of the topic group 1.c save the signatures in a tree, organized by signature overlap. We use the format of Webclopedia’s parser CONTEX [5,6]; see Figure 1 1.d use Webclopedia’s ranking algorithm to rank sentences [8]. 3.2 Filter for Content Given the ranked list of sentences, re-rank or remove those according to the following conditions: 2.a remove all sentences with sentence position > 10. This is a simple version of SUMMARIST’s Optimum Position Policy (OPP) [10], which records the relative importance of sentence positions 2.b decrease ranking score of all sentence containing stigma words [4] (day names; time expressions; sentences starting with conjunctions such as “but”, “although”; sentences containing quotation marks; sentences containing the verb “say”).