A. Jorge et al. (Eds.): PKDD 2005, LNAI 3721, pp. 625 – 633, 2005.
© Springer-Verlag Berlin Heidelberg 2005
Improvements in the Data Partitioning Approach for
Frequent Itemsets Mining
Son N. Nguyen and Maria E. Orlowska
School of Information Technology and Electrical Engineering,
The University of Queensland, QLD 4072, Australia
{nnson, maria)@itee.uq.edu.au
Abstract. Frequent Itemsets mining is well explored for various data types, and
its computational complexity is well understood. There are methods to deal ef-
fectively with computational problems. This paper shows another approach to
further performance enhancements of frequent items sets computation.
We have made a series of observations that led us to inventing data pre-
processing methods such that the final step of the Partition algorithm, where a
combination of all local candidate sets must be processed, is executed on sub-
stantially smaller input data. The paper shows results from several experiments
that confirmed our general and formally presented observations.
Keywords: Association rules, Frequent itemset, Partition, Performance.
1 Introduction
Since the association rules mining introduction by Argawal et al. [5], many algo-
rithms and their subsequent improvements have been proposed to solve association
rules mining, especially frequent itemsets mining problems.
In this paper, we review the state of the art in association rules mining with a focus
on frequent itemsets mining. There are many well-accepted approaches such as “Ap-
riori” by Argawal et al. [1], ECLAT by Zaki [7], and more recently “FP-growth” by
Han et al. [8]. Another interesting class of solutions is based on the data partitioning
approach. This fundamental concept was originally proposed as a Partition algorithm
by Savaserse et al. [2], and it was improved later in AS-CPA by Lin et al. [4] and
ARMOR by Pudi et al. [11]. A common feature of these results is their target, namely
the limitation of I/O operations by considering data subsets dictated by the main
memory size.
An intriguing question is whether we could improve the overall performance of
mining large data sets by a smarter but not too ‘expensive’ design of the data frag-
ments - rather than determine them by a sequential transaction allocation based on the
fragment size only.
The main goal of this paper is to demonstrate our observations, generalize, and
specify corresponding data pre-processing for the Partitioning approach in order to
improve the performance. Our study is supported by a series of experiments which
indicate a dramatic improvement in the performance of the Partitioning approach with
our fragmentation method, in contrast to the traditional one [2].