I.J. Intelligent Systems and Applications, 2017, 1, 75-84
Published Online January 2017 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijisa.2017.01.08
Copyright © 2017 MECS I.J. Intelligent Systems and Applications, 2017, 1, 75-84
High Performance Computation of Big Data:
Performance Optimization Approach towards a
Parallel Frequent Item Set Mining Algorithm for
Transaction Data based on Hadoop MapReduce
Framework
Guru Prasad M S
SDMIT/CSE, Ujire, 577240, India
E-mail: guru0927@gmail.co m
Nagesh H R and Swathi Prabhu
MITE/CSE, Moodbidri, 574227, India
SMVITM/CSE, Udupi, 576115, India
E-mail: nageshhrcs@reddifmail.com, prabhuswathi2@gmail.com
Abstract —The Huge amount of Big Data is constantly
arriving with the rapid development of business
organizations and they are interested in extracting
knowledgeable information from collected data. Frequent
item mining of Big Data helps with business decision and
to provide high quality service. The result of traditional
frequent item set mining algorithm on Big Data is not an
effective way which leads to high computation time. An
Apache Hadoop MapReduce is the most popular data
intensive distributed computing framework for large scale
data applications such as data mining. In this paper, the
author identifies the factors affecting on the performance
of frequent item mining algorithm based on Hadoop
MapReduce technology and proposed an approach for
optimizing the performance of large scale frequent item
set mining. The Experiments result shows the potential of
the proposed approach. Performance is significantly
optimized for large scale data mining in MapReduce
technique. The author believes that it has a valuable
contribution in the high performance computing of Big
Data.
Index Terms—Big Data, Hadoop, MapReduce, Hadoop
Distributed File System (HDFS), Apriori MapReduce,
FP-growth MapReduce.
I. INT RODUCT ION
We live in the Big Data Era. Big Data is a broad term
that describes a massive volume of structured, semi-
structured and unstructured data. Due to the advent of
new technologies and digital world of data is expanded to
10 zettabytes (10
21
bytes). Huge amount of data is
generated from social networking sites, e-commerce, on-
line banking, weather stations, market transactions etc.
Big Data is mainly characterized by 3 V’s extreme
volumes, extreme variety and extreme velocity. Volume
can vary beyond zettabytes. Velocity defines the speed at
which data is generated and huge variety of data can be
used. It is really critical to business enterprises and its
emerging as one of the most important technologies in
modern world. Many business enterprises accumulate
large quantities of data from customer transactions; they
handle more than one billion customer transaction every
day. For example, EBay has 50 petabytes of data it
captures 50 terabytes every day, US retailers have around
500 petabytes of data, Amazon is world's biggest retail
store which has billions of active customer data etc.
Huge amount of data continuously collected and stored in
their data warehouse. Now business organizations are
interested in extracting knowledgeable information from
stored data. The information contained in transaction
database is large. So it is very difficult to understand and
also difficult to extract knowledgeable information from
this huge dataset. To solve this problem the technique
called frequent item mining is used. This technique finds
the frequency of item purchased together. It is useful in
extracting hidden predictive information from large data
sets. And it is a powerful technology with extreme
potential to help organizations. It focus on the most
important information in their data warehouse. The
frequent item mining technique predicts future trends and
behaviors. It allows businesses to undertake proactive,
knowledge-driven decisions. Apriori and FP-growth are
the most famous algorithms to discover frequent patterns
in large data sets. However, the existing data mining tools
based on sequential Apriori and FP-growth algorithms are
not efficient to mine a huge transaction data.
We would require a robust distributed computing
infrastructure that can store, manage and process huge
amounts of data in short time. It can protect data security