Selecting Text Spans for Document Summaries: Heuristics and Metrics Vibhu Mittal Mark Kantrowitz Jade Goldstein Jaime Carbonell Just Research Language Technologies Institute 4616 Henry Street Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 U.S.A. U.S.A. Abstract Human-quality text summarization systems are difficult to de- sign, and even more difficult to evaluate, in part because doc- uments can differ along several dimensions, such as length, writing style and lexical usage. Nevertheless, certain cues can often help suggest the selection of sentences for inclu- sion in a summary. This paper presents an analysis of news- article summaries generated by sentence extraction. Sen- tences are ranked for potential inclusion in the summary using a weighted combination of linguistic features – derived from an analysis of news-wire summaries. This paper evaluates the relative effectiveness of these features. In order to do so, we discuss the construction of a large corpus of extraction- based summaries, and characterize the underlying degree of difficulty of summarization at different compression levels on articles in this corpus. Results on our feature set are presented after normalization by this degree of difficulty. Introduction Summarization is a particularly difficult task for computers because it requires natural language understanding, abstrac- tion and generation. Effective summarization, like effective writing, is neither easy and nor innate; rather, it is a skill that is developed through instruction and practice. Writing a summary requires the summarizer to select, evaluate, or- der and aggregate items of information according to their relevance to a particular subject or purpose. Most of the previous work in summarization has fo- cused on a related, but simpler, problem: text-span dele- tion. In text-span deletion – also referred to as text-span extraction – the system attempts to delete “less important” spans of text from the original document; the text that re- mains can be deemed a summary of the original document. Most of the previous work on extraction-based summariza- tion is based on the use of statistical techniques such as frequency or variance analysis applied to linguistic units such as tokens, names, anaphoric or co-reference informa- tion (e.g., (Baldwin & Morton 1998; Boguraev & Kennedy 1997; Aone et al. 1997; Carbonell & Goldstein 1998; Hovy & Lin 1997; Mitra, Singhal, & Buckley 1997)). More involved approaches have attempted to use discourse Copyright c 1999, American Association of Artificial Intelligence (www.aaai.org). All rights reserved.. structure (Marcu 1997), combinations of information ex- traction and language generation (Klavans & Shaw 1995; McKeown, Robin, & Kukich 1995), and the use of ma- chine learning to find patterns in text (Teufel & Moens 1997; Barzilay & Elhadad 1997; Strzalkowski, Wang, & Wise 1998). However, it is difficult to compare the relative mer- its of these various approaches because most of the evalua- tions reported were conducted on different corpora, of vary- ing sizes at varying levels of compression, and were often informal and subjective. This paper discusses summarization by sentence extrac- tion and makes the following contributions: (1) based on a corpus of approximately 25,000 news stories, we identi- fied several syntactic and linguistic features for ranking sen- tences, (2) we evaluated these features – on a held-out test set that was not used for the analysis – at different levels of compression, and (3) finally, we discuss the degree of diffi- culty inherent in the corpus being used for the task evalua- tion in an effort to normalize scores obtained across different corpora. Ranking Text Spans for Selection The text-span selection paradigm transforms the problem of summarization, which in the most general case requires the ability to understand, interpret, abstract and generate a new document, into a different problem: ranking sentences from the original document according to their salience (or like- lihood of being part of a summary). This kind of summa- rization is closely related to the more general problem of in- formation retrieval, where documents from a document set (rather than sentences from a document) are ranked, in order to retrieve the most relevant documents. Ranking text-spans for importance requires defining at least two parameters: (i) the granularity of the text-spans, and (ii) metrics for ranking span salience. While there are several advantages of using paragraphs as the minimal unit of span extraction, we shall conform to the vast majority of previous work and use the sentence as our level of choice. However, documents to be summarized can be analyzed at varying levels of detail, and in this paper, each sentence can be ranked by considering the following three levels: Sub-document Level: Different regions in a document of- ten have very different levels of significance for summa- rization. These sub-documents become especially im-