Effective and Efficient Itemset Pattern Summarization: Regression-based Approaches Ruoming Jin, Muad Abu-Ata, Yang Xiang, Ning Ruan Department of Computer Science, Kent State University Kent, OH, 44242, USA {jin,mabuata,yxiang,nruan}@cs.kent.edu ABSTRACT In this paper, we propose a set of novel regression-based ap- proaches to effectively and efficiently summarize frequent item- set patterns. Specifically, we show that the problem of minimiz- ing the restoration error for a set of itemsets based on a proba- bilistic model corresponds to a non-linear regression problem. We show that under certain conditions, we can transform the non-linear regression problem to a linear regression problem. We propose two new methods, k-regression and tree-regression, to partition the entire collection of frequent itemsets in order to minimize the restoration error. The K-regression approach, employing a K-means type clustering method, guarantees that the total restoration error achieves a local minimum. The tree- regression approach employs a decision-tree type of top-down partition process. In addition, we discuss alternatives to estimate the frequency for the collection of itemsets being covered by the k representative itemsets. The experimental evaluation on both real and synthetic datasets demonstrates that our approaches sig- nificantly improve the summarization performance in terms of both accuracy (restoration error), and computational cost. Categories and Subject Descriptors: H.2.8 [Database Manage- ment]: Database Applications - Data Mining General Terms: Algorithms, Performance Keywords: frequency restoration, pattern summarization, regres- sion 1. INTRODUCTION Since its introduction in [3], frequent pattern mining has re- ceived a great deal of attention and quickly evolved into a ma- jor research subject in data mining. The tools offered by fre- quent pattern mining research span a variety of data types, in- cluding itemsets, sequences, trees, and graphs [25, 4, 31, 5]. Re- searchers from many scientific disciplines and business domains have demonstrated the benefits from frequent pattern analysis— insight into their data and knowledge of hidden mechanisms [10]. At the same time, frequent pattern mining serves as a basic tool for many other data mining tasks, including association rule min- ing, classification, clustering, and change detection [14, 32, 13, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’08, August 24–27, 2008, Las Vegas, Nevada, USA. Copyright 2008 ACM 978-1-60558-193-4/08/08 ...$5.00. 15]. Recently, standard frequent mining tools, like Apriori, have been incorporated into several commercial database systems [18, 30, 23]. The growing popularity of frequent pattern mining, however, does not exempt it from criticism. One of the major issues facing frequent pattern mining is that it can (and often does) produce an unwieldy number of patterns. So-called complete frequent pat- tern mining algorithms try to identify all the patterns which occur more frequently than a minimal support threshold (θ) in the de- sired datasets. A typical complete pattern mining tool can easily discover tens of thousands, if not millions, of frequent patterns. Clearly, it is impossible for scientists or any domain experts to manually go over such a large collection of patterns. In some sense, the frequent patterns themselves are becoming the “data” which needs to be mined. Indeed, reducing the number of frequent patterns has been a major theme in frequent pattern mining research. Much of the research has been on itemsets; itemset data can be generalized to many other pattern types. One general approach has been to mine only patterns that satisfy certain constraints; well-known examples include mining maximal frequent patterns [21], closed frequent patterns [19] and non-derivable itemsets [8]. The last two methods are generally referred to as lossless compression since we can fully recover the exact frequency of any frequent itemsets. The first one is lossy compression since we cannot re- cover the exact frequencies. Recently, Xin et al. [27] generalize closed frequent itemsets to discover a group of frequent itemsets which δ-cover the entire collection of frequent itemsets. If one itemset is a subset of another itemset and its frequency is very close to the frequency of the latter superset, i.e., within a small fraction (δ), then the first one is referred to as being δ-covered by the latter one. However, the patterns being produced by all these methods are still too numerous to be very useful. Even the δ- cover method easily generates thousands of itemset patterns. At the same time, methods like top-k frequent patterns [11], top-k redundancy-aware patterns [26], and error-tolerant patterns [29] try to rank the importance of individual patterns, or revise the fre- quency concept to reduce the number of frequent patterns. How- ever, these methods generally do not provide a good representa- tion of the collection of frequent patterns. This leads to the central topic of this paper: what are good criteria to concisely represent a large collection of frequent item- sets, and how can one find the optimal representations efficiently? Recently, several approaches have been proposed to tackle this issue [2, 28, 24]. Two key criteria being employed for evaluating the concise representation of itemsets are the coverage criterion and frequency criterion. Generally speaking, the coverage crite- rion assumes the concise representation is composed of a small number of itemsets with the entire collection of frequent item- 399