A. Jorge et al. (Eds.): PKDD 2005, LNAI 3721, pp. 625 633, 2005. © Springer-Verlag Berlin Heidelberg 2005 Improvements in the Data Partitioning Approach for Frequent Itemsets Mining Son N. Nguyen and Maria E. Orlowska School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, Australia {nnson, maria)@itee.uq.edu.au Abstract. Frequent Itemsets mining is well explored for various data types, and its computational complexity is well understood. There are methods to deal ef- fectively with computational problems. This paper shows another approach to further performance enhancements of frequent items sets computation. We have made a series of observations that led us to inventing data pre- processing methods such that the final step of the Partition algorithm, where a combination of all local candidate sets must be processed, is executed on sub- stantially smaller input data. The paper shows results from several experiments that confirmed our general and formally presented observations. Keywords: Association rules, Frequent itemset, Partition, Performance. 1 Introduction Since the association rules mining introduction by Argawal et al. [5], many algo- rithms and their subsequent improvements have been proposed to solve association rules mining, especially frequent itemsets mining problems. In this paper, we review the state of the art in association rules mining with a focus on frequent itemsets mining. There are many well-accepted approaches such as “Ap- riori” by Argawal et al. [1], ECLAT by Zaki [7], and more recently “FP-growth” by Han et al. [8]. Another interesting class of solutions is based on the data partitioning approach. This fundamental concept was originally proposed as a Partition algorithm by Savaserse et al. [2], and it was improved later in AS-CPA by Lin et al. [4] and ARMOR by Pudi et al. [11]. A common feature of these results is their target, namely the limitation of I/O operations by considering data subsets dictated by the main memory size. An intriguing question is whether we could improve the overall performance of mining large data sets by a smarter but not too ‘expensive’ design of the data frag- ments - rather than determine them by a sequential transaction allocation based on the fragment size only. The main goal of this paper is to demonstrate our observations, generalize, and specify corresponding data pre-processing for the Partitioning approach in order to improve the performance. Our study is supported by a series of experiments which indicate a dramatic improvement in the performance of the Partitioning approach with our fragmentation method, in contrast to the traditional one [2].