Applying the Pyramid Method in DUC 2005 Rebecca J. Passonneau and Ani Nenkova and Kathleen McKeown and Sergey Sigelman Columbia University Computer Science Department New York, NY 10027 {becky,ani,kathy,ss1792}@cs.columbia.edu Abstract In DUC 2005, the pyramid method for content evaluation was used for the first time in a cross- site evaluation. We discuss the method used in creating pyramid models and performing peer annotation. Analysis of score averages for the peers indicates that the best systems score half as well as humans, and that systems can be grouped into better and worse performers. There were few significant differences among systems. High score correlations between sets from different annotators, and good interanno- tator agreement, indicate that participants can perform annotation reliably. We found that a modified pyramid score gave good results and would simplify peer annotation in the future. 1 Introduction Since 2001, the annual Document Understanding Con- ferences (DUC) have pursued the goal established in a 2000 roadmap to develop and evaluate sophisticated au- tomated techniques for document summarization. How- ever, developing evaluation methods for summarization has been difficult because human summaries vary for many reasons, including the knowledge, biases, goals, and intended audience of the summary writer. The pyra- mid method for content evaluation (Nenkova and Passon- neau, 2004) addresses the variation in content across hu- man summaries of the same source texts. Designed to handle abstractive summarization, the pyramid method differs from previous evaluation meth- ods primarily in assigning weights to content units, based on a model constructed from multiple human summaries. A new summary is rewarded more for containing infor- mation that occurs more often across a sample of human summaries. The research focus is thus to distinguish be- tween more and less relevant information. As in previous work (van Halteren and Teufel, 2003), content is identi- fied on the basis of shared meaning, not shared words or word strings (ngrams), thus this evaluation method leaves systems relatively unconstrained with respect to the way in which content is expressed. Here it is applied to sys- tems which are primarily extractive. To apply the pyramid method, DUC 2005 relied on manual methods for constructing the pyramid models as- sociated with each document cluster for 20 sets, and for annotating the 25 peer summaries produced by systems, plus two by humans for each set. Columbia University constructed the pyramids, and participants in the evalu- ation did the peer annotations. Scores for the annotated peers were computed automatically as part of the annota- tion tool distributed by Columbia. Our results show that pyramid scores group systems into better and worse performers, based on individual comparisons, although no single system can be identified as best across the different metrics used in DUC05 (orig- inal and modified pyramid, responsiveness, and ROUGE scores). Our analyses indicate that peer annotation is reli- able on two measures, interannotator agreement, and con- sistency of scores. We also discuss results of a modified pyramid score that is analogous to recall; it correlates highly with the original score, but is easier to produce annotations for. Finally, analysis of the pyramids them- selves show that humans produced summaries in 2005 that had more variation than summaries produced in DUC 2003 and we suspect that this is due to increased sum- mary and document length as well as larger cluster size. 2 Pyramids Twenty document clusters, or topics, were pre- pared by NIST assessors from TREC documents, following instructions provided at http://www- nlpir.nist.gov/projects/duc/duc2005/tasks.html. Each cluster was to contain between 25 and 50 documents relevant to a request for information created by the assessor; the average cluster size was 30.4 documents of 720 words each. For each topic, nine summaries of ap- proximately 250 words each were written by humans. Of these, seven were used for each of the twenty pyramids. The remaining two were included in the peer evaluation. Based on previous work, the use of seven summaries