ParallelCharMax: An Effective Maximal Frequent Itemset Mining Algorithm Based on MapReduce Framework Rania Mkhinini Gahar 1 , Olfa Arfaoui 2 , Minyar Sassi Hidri 3 , Nejib Ben Hadj-Alouane 4 1,2,3,4 University of Tunis El Manar 1,2,3,4 National Engineering School of Tunis 3 Imam Abdulrahman Bin Faisal University, Dammam, Arabie Saoudite 1,2,3,4 BP. 37, Le Belvèdère 1002, Tunis, Tunisia Abstract—Nowadays, the explosive growth in data collection in business and scientific areas has required the need to analyze and mine useful knowledge residing in these data. The recourse to data mining techniques seems to be inescapable in order to extract useful and novel patterns/models from large datasets. In this context, frequent itemsets (patterns) play an essential role in many data mining tasks that try to find interesting patterns from datasets. However, conventional approaches for mining frequent itemsets in Big Data era encounter significant challenges when computing power and memory space are limited. This paper proposes an efficient distributed frequent itemset mining algorithm, called ParallelCharMax, that is based on a powerful sequential algorithm, called Charm, and computes the maximal frequent itemsets that are considered perfect summaries of the frequent ones. The proposed algorithm has been implemented using MapReduce framework. The experimental component of the study shows the efficiency and the performance of the proposed algorithm compared with well known algorithms such as MineWithRounds and HMBA. KeywordsFrequent Itemset Mining, Parallel Mining Algorithm, MapReduce, Charm. I. I NTRODUCTION Data Mining and Knowledge Discovery in Datasets (KDD) is a new interdisciplinary field which presents an intersection of statistics, machine learning, databases, and parallel and distributed computing. It has been generated by the important growth of data in all spheres of human effort, and the eco- nomic and scientific need to extract useful information from the collected data. The key challenge in data mining is the extraction of knowledge from massive datasets. Data mining refers to the overall process of discovering new patterns or building models from a given dataset. There are many steps involved in the KDD process which include data selection, data cleaning and preprocessing, data transformation and reduction, data-mining task and algorithm selection, and finally post-processing and interpretation of discovered knowl- edge [2], [3]. This KDD process tends to be highly iterative and interactive. In this context, the patterns extraction is one of the most important techniques of data mining. It stimulates a great effort that led to a variety of proposed algorithms throughout the last two decades. Several researches are directed to extract more precisely the maximum frequent patterns and this amounts to the fact that these latter can be considered as perfect summaries of the frequent sets since they can be much less numerous than the frequent closed patterns. Moreover, today still, it remains a topical issue as new challenges arise, particularly with the emergence of mega-data (Big Data) and the development of data science. While data mining has its roots in the traditional fields of machine learning and statistics, the huge volume of data today poses the most serious problem. For example, many companies already have data warehouses in the terabyte range (e.g., FedEx, UPS, Walmart). Similarly, scientific data is reaching gigantic proportions (e.g., NASA space missions, Human Genome Project). Traditional methods typically made the assumption that the data is memory resident. This as- sumption is no longer tenable. Implementation of data mining ideas in high-performance parallel and distributed computing environments is thus becoming crucial for ensuring system scalability and interactivity as data continues to grow in size and complexity [4]. Parallel and distributed computing is expected to relieve current mining methods from the sequential bottleneck, pro- viding the ability to scale to massive datasets, and improving the response time. Achieving good performance on today’s multiprocessor systems is a non-trivial task. The main chal- lenges include synchronization and communication minimiza- tion, work-load balancing, finding good data layout and data decomposition, and disk Input/Output minimization, which is especially important for data mining. In order to answer the different stakes posed, we propose a new parallel algorithm for discovering frequent maximal itemsets based on MapReduce framework. The rest of the paper is organized as follows; section 2 presents the basic frequent itemsets and association rules min- ing problems. Sections 3 describes related work the main paral- lel and distributed techniques used to solve these problems and give a comprehensive survey of the most influential algorithms that were proposed during the last decade. Section 4 focused on a powerful sequential algorithm for searching maximal frequent itemsets on which we will base in the fourth section in 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications 2161-5330/17 $31.00 © 2017 IEEE DOI 10.1109/AICCSA.2017.80 571