ParallelCharMax: An Effective Maximal Frequent
Itemset Mining Algorithm Based on MapReduce
Framework
Rania Mkhinini Gahar
1
, Olfa Arfaoui
2
, Minyar Sassi Hidri
3
, Nejib Ben Hadj-Alouane
4
1,2,3,4
University of Tunis El Manar
1,2,3,4
National Engineering School of Tunis
3
Imam Abdulrahman Bin Faisal University, Dammam, Arabie Saoudite
1,2,3,4
BP. 37, Le Belvèdère 1002, Tunis, Tunisia
Abstract—Nowadays, the explosive growth in data collection in
business and scientific areas has required the need to analyze
and mine useful knowledge residing in these data. The recourse
to data mining techniques seems to be inescapable in order to
extract useful and novel patterns/models from large datasets. In
this context, frequent itemsets (patterns) play an essential role
in many data mining tasks that try to find interesting patterns
from datasets. However, conventional approaches for mining
frequent itemsets in Big Data era encounter significant challenges
when computing power and memory space are limited. This
paper proposes an efficient distributed frequent itemset mining
algorithm, called ParallelCharMax, that is based on a powerful
sequential algorithm, called Charm, and computes the maximal
frequent itemsets that are considered perfect summaries of the
frequent ones. The proposed algorithm has been implemented
using MapReduce framework. The experimental component of
the study shows the efficiency and the performance of the
proposed algorithm compared with well known algorithms such
as MineWithRounds and HMBA.
Keywords—Frequent Itemset Mining, Parallel Mining Algorithm,
MapReduce, Charm.
I. I NTRODUCTION
Data Mining and Knowledge Discovery in Datasets (KDD)
is a new interdisciplinary field which presents an intersection
of statistics, machine learning, databases, and parallel and
distributed computing. It has been generated by the important
growth of data in all spheres of human effort, and the eco-
nomic and scientific need to extract useful information from
the collected data. The key challenge in data mining is the
extraction of knowledge from massive datasets.
Data mining refers to the overall process of discovering new
patterns or building models from a given dataset. There are
many steps involved in the KDD process which include data
selection, data cleaning and preprocessing, data transformation
and reduction, data-mining task and algorithm selection, and
finally post-processing and interpretation of discovered knowl-
edge [2], [3]. This KDD process tends to be highly iterative
and interactive.
In this context, the patterns extraction is one of the most
important techniques of data mining. It stimulates a great effort
that led to a variety of proposed algorithms throughout the last
two decades. Several researches are directed to extract more
precisely the maximum frequent patterns and this amounts to
the fact that these latter can be considered as perfect summaries
of the frequent sets since they can be much less numerous than
the frequent closed patterns. Moreover, today still, it remains
a topical issue as new challenges arise, particularly with the
emergence of mega-data (Big Data) and the development of
data science.
While data mining has its roots in the traditional fields
of machine learning and statistics, the huge volume of data
today poses the most serious problem. For example, many
companies already have data warehouses in the terabyte range
(e.g., FedEx, UPS, Walmart). Similarly, scientific data is
reaching gigantic proportions (e.g., NASA space missions,
Human Genome Project). Traditional methods typically made
the assumption that the data is memory resident. This as-
sumption is no longer tenable. Implementation of data mining
ideas in high-performance parallel and distributed computing
environments is thus becoming crucial for ensuring system
scalability and interactivity as data continues to grow in size
and complexity [4].
Parallel and distributed computing is expected to relieve
current mining methods from the sequential bottleneck, pro-
viding the ability to scale to massive datasets, and improving
the response time. Achieving good performance on today’s
multiprocessor systems is a non-trivial task. The main chal-
lenges include synchronization and communication minimiza-
tion, work-load balancing, finding good data layout and data
decomposition, and disk Input/Output minimization, which is
especially important for data mining.
In order to answer the different stakes posed, we propose
a new parallel algorithm for discovering frequent maximal
itemsets based on MapReduce framework.
The rest of the paper is organized as follows; section 2
presents the basic frequent itemsets and association rules min-
ing problems. Sections 3 describes related work the main paral-
lel and distributed techniques used to solve these problems and
give a comprehensive survey of the most influential algorithms
that were proposed during the last decade. Section 4 focused
on a powerful sequential algorithm for searching maximal
frequent itemsets on which we will base in the fourth section in
2017 IEEE/ACS 14th International Conference on Computer Systems and Applications
2161-5330/17 $31.00 © 2017 IEEE
DOI 10.1109/AICCSA.2017.80
571