Multi-document summarization exploiting frequent itemsets Elena Baralis Politecnico di Torino Corso Duca degli Abruzzi Torino, Italy elena.baralis@polito.it Luca Cagliero Politecnico di Torino Corso Duca degli Abruzzi Torino, Italy luca.cagliero@polito.it Alessandro Fiori Politecnico di Torino Corso Duca degli Abruzzi Torino, Italy alessandro.fiori@polito.it Saima Jabeen Politecnico di Torino Corso Duca degli Abruzzi Torino, Italy saima.jabeen@polito.it ABSTRACT A summary is a succinct and informative description of a data collection. In the context of multi-document summa- rization, the selection of the most relevant and not redun- dant sentences belonging to a collection of textual docu- ments is definitely a challenging task. Frequent itemset min- ing is a well-established data mining technique to discover correlations among data. Although it has been widely used in transactional data analysis, to the best of our knowledge, its exploitation in document summarization has never been investigated so far. This paper presents a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer), that is based on an itemset-based model, i.e., a model composed of fre- quent itemsets, extracted from the document collection. It automatically selects the most representative and not re- dundant sentences to include in the summary by consider- ing both sentence coverage, with respect to a concise and highly informative itemset-based model, and a sentence rel- evance score, based on tf-idf statistics. Experimental results, performed on the DUC’04 document collection by means of ROUGE toolkit, show that the proposed approach achieves better performance than a large set of competitors. Categories and Subject Descriptors I.5.4 [Pattern Recognitions]: ApplicationsText Process- ing; H.2.8 [Information Systems]: Database Management- Database ApplicationsData Mining General Terms Algorithms Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’12 March 25-29, 2012, Riva del Garda, Italy. Copyright 2011 ACM 978-1-4503-0857-1/12/03 ...$10.00. Keywords Multi-document summarization, Text mining, Frequent item- set mining 1. INTRODUCTION In last years, the increasing availability of textual doc- uments in the electronic form has prompted the need of efficient and effective data mining approaches suitable for textual data analysis. Summarization is a challenging data mining task that focuses on constructing a succinct and in- formative description of a data collection. In the context of multi-document summarization, a summary is composed of the most representative sentences belonging to a document collection. A number of different approaches have been proposed to select the most relevant sentences (e.g., [11, 14, 17]). They commonly evaluate sentences according to cluster-based or graph-based models. For instance, the approach recently proposed in [17] exploits an incremental hierarchical clus- tering algorithm with the two-fold aim at identifying groups of sentences that share the same content and updating sum- maries over time. Differently, [11] proposed to represent correlations among sentences by means of a graph-based model. Most relevant sentences are selected according to the eigenvector centrality computed by means of the well-known PageRank algorithm [4]. A parallel research effort has been devoted to formalizing the summarization task as a maxi- mum coverage problem with Knapsack constraints based on sentence relevance within each document [14]. However, pre- vious approaches typically focus on single word significance while do not effectively capture correlations among multiple words at the same time. Frequent itemset mining [1] is a widely exploratory tech- nique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, to the best of our knowledge, the us- age of frequent itemsets in textual document summarization has never been investigated so far. In recent years, a number of approaches addressed the discovery and selection of the most informative yet non-redundant set of frequent itemsets mined from transactional data (e.g., [7, 15]). Some of them compared the observed frequency (i.e., the support) of each