Citation: Al-Bana, M.R.; Farhan, M.S.;
Othman, N.A. An Efficient
Spark-Based Hybrid Frequent
Itemset Mining Algorithm for Big
Data. Data 2022, 7, 11. https://
doi.org/10.3390/data7010011
Academic Editor: Giuseppe Ciaburro
Received: 28 November 2021
Accepted: 7 January 2022
Published: 14 January 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
data
Article
An Efficient Spark-Based Hybrid Frequent Itemset Mining
Algorithm for Big Data
Mohamed Reda Al-Bana
1,
* , Marwa Salah Farhan
1,2,
* and Nermin Abdelhakim Othman
1,2
1
Department of Information Systems, Faculty of Computers and Artificial Intelligence, Helwan University,
Cairo 11795, Egypt; drnermin@fci.helwan.edu.eg or Nermin.Othman@bue.edu.eg
2
Faculty of Informatics and Computer Science, British University in Egypt, Cairo 11837, Egypt
* Correspondence: Mohammed.Bana@fci.helwan.edu.eg (M.R.A.-B.); Marwa.salah@fci.helwan.edu.eg (M.S.F.)
Abstract: Frequent itemset mining (FIM) is a common approach for discovering hidden frequent
patterns from transactional databases used in prediction, association rules, classification, etc. Apriori
is an FIM elementary algorithm with iterative nature used to find the frequent itemsets. Apriori is
used to scan the dataset multiple times to generate big frequent itemsets with different cardinalities.
Apriori performance descends when data gets bigger due to the multiple dataset scan to extract the
frequent itemsets. Eclat is a scalable version of the Apriori algorithm that utilizes a vertical layout.
The vertical layout has many advantages; it helps to solve the problem of multiple datasets scanning
and has information that helps to find each itemset support. In a vertical layout, itemset support can
be achieved by intersecting transaction ids (tidset/tids) and pruning irrelevant itemsets. However,
when tids become too big for memory, it affects algorithms efficiency. In this paper, we introduce
SHFIM (spark-based hybrid frequent itemset mining), which is a three-phase algorithm that utilizes
both horizontal and vertical layout diffset instead of tidset to keep track of the differences between
transaction ids rather than the intersections. Moreover, some improvements are developed to decrease
the number of candidate itemsets. SHFIM is implemented and tested over the Spark framework,
which utilizes the RDD (resilient distributed datasets) concept and in-memory processing that tackles
MapReduce framework problem. We compared the SHFIM performance with Spark-based Eclat
and dEclat algorithms for the four benchmark datasets. Experimental results proved that SHFIM
outperforms Eclat and dEclat Spark-based algorithms in both dense and sparse datasets in terms of
execution time.
Keywords: big data; frequent pattern mining; horizontal layout; vertical layout; diffset; Spark
1. Introduction
We are currently living in the big data age. Data appear everywhere in a variety of
formats and types ranging from structured to unstructured data and are produced by a huge
number of sources across a wide range of disciplines and types, including transactional
systems, user interactions, social networks, the Internet of Things, the World Wide Web,
and many others. Companies and individuals gather and store all these generated data
to analyze them for insight, knowledge, and decision making. Therefore, we have been
swamped with big data, not just because we already have large amounts of data that need
to be processed but also because the amount of data is rapidly growing every moment. The
concept of big data has some properties that are collectively known as the “3Vs” model.
Volume is defined as the amount of data; enormous amounts of data are generated and
gathered. Velocity refers to the high rate at which data are created, gathered, and processed
(streams, batch, near-real-time, and real-time). Variety indicates the different types of data:
audio, images/video, and text; conventional structured data; and mixed data. In addition,
there are two more features added to the “3Vs” model and known as the “5Vs” model.
Veracity refers to how much the data is accurate and trusted when it comes from various
Data 2022, 7, 11. https://doi.org/10.3390/data7010011 https://www.mdpi.com/journal/data