Parallel and Distributed Frequent Itemset Mining on Dynamic Datasets Adriano Veloso , Matthew Erick Otey Srinivasan Parthasarathy , and Wagner Meira Jr. Computer Science Department, Universidade Federal de Minas Gerais, Brazil adrianov,meira @dcc.ufmg.br Department of Computer and Information Science, The Ohio-State University, USA otey, srini @cis.ohio-state.edu Abstract Traditional methods for data mining typically make the assumption that data is centralized and static. This assumption is no longer tenable. Such methods waste computational and I/O resources when data is dynamic, and they impose excessive communication overhead when data is distributed. As a result, the knowledge discovery process is harmed by slow response times. Efficient im- plementation of incremental data mining ideas in distributed computing environ- ments is thus becoming crucial for ensuring scalability and facilitate knowledge discovery when data is dynamic and distributed. In this paper we address this issue in the context of frequent itemset mining, an important data mining task. Frequent itemsets are most often used to generate correlations and association rules, but more recently they have also been used in such far-reaching domains as bio-informatics and e-commerce applications. We first present an efficient algo- rithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to par- allelize the incremental algorithm, so that it can asynchronously mine frequent itemsets. Further, we also propose a distributed algorithm, which imposes low communication overhead for mining distributed datasets. Several experiments confirm that our algorithm results in excellent execution time improvements. 1 Introduction The field of knowledge discovery and data mining (KDD), spurred by advances in data collection technology, is concerned with the process of deriving interesting and useful patterns from large datasets. Frequent itemset mining is a core data mining task. Its statement is very simple: to find the set of all subsets of items that frequently occur to- gether in database transactions. Although the frequent itemset mining task has a simple statement, it is CPU and I/O intensive, mostly because the large number of itemsets that are typically generated and the large size of the datasets involved in the process. Now consider the problem of mining frequent itemsets on a dynamic dataset, like those found in e-commerce and web-based domains. The datasets in such domains are constantly updated with fresh data. Let us assume that at some point in time we have computed all frequent itemsets for such a dataset. Now, if the dataset is updated, then This work was done while the first author was visiting the Ohio-State University