Aggregation of Multiple Judgments for Evaluating Ordered Lists Hyun Duk Kim, ChengXiang Zhai and Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801, USA hkim277@illinois.edu,czhai@cs.uiuc.edu,hanj@cs.uiuc.edu Abstract. Many tasks (e.g., search and summarization) result in an or- dered list of items. In order to evaluate such an ordered list of items, we need to compare it with an ideal ordered list created by a human expert for the same set of items. To reduce any bias, multiple human experts are often used to create multiple ideal ordered lists. An inter- esting challenge in such an evaluation method is thus how to aggregate these diﬀerent ideal lists to compute a single score for an ordered list to be evaluated. In this paper, we propose three new methods for ag- gregating multiple order judgments to evaluate ordered lists: weighted correlation aggregation, rank-based aggregation, and frequent sequential pattern-based aggregation. Experiment results on ordering sentences for text summarization show that all the three new methods outperform the state of the art average correlation methods in terms of discriminative- ness and robustness against noise. Among the three proposed methods, the frequent sequential pattern-based method performs the best due to the ﬂexible modeling of agreements and disagreements among human experts at various levels of granularity. Key words: Evaluation, Sentence ordering, Judgment aggregation, Fre- quent sequential pattern mining 1 Introduction How to aggregate diﬀerent human evaluators’ judgments is a diﬃcult problem in evaluation with multiple human annotations. When we evaluate the performance of a system, we often compare the output of system with a “gold standard output” created by a human evaluator; the more similar the system output is to the human-created gold standard, the better the performance of the system would be. Unfortunately, when a task is diﬃcult or inherently subjective to judge (as in the case of many information retrieval problems such as search and summa- rization), human experts may not agree with each other on the gold standard. Thus using only one single human expert to create the gold standard can be biased, and it would be necessary to have multiple experts to create a gold stan- dard, leading naturally to multiple (gold standard) judgments, each created by a diﬀerent human expert. The research question we study in this paper is how to aggregate these mul- tiple judgments created by multiple experts to evaluate ordered lists of items.