Available Online at www.ijcsmc.com
International Journal of Computer Science and Mobile Computing
A Monthly Journal of Computer Science and Information Technology
ISSN 2320–088X
IJCSMC, Vol. 2, Issue. 4, April 2013, pg.513 – 516
RESEARCH ARTICLE
© 2013, IJCSMC All Rights Reserved 513
IMPLEMENTATION OF PARALLEL APRIORI
ALGORITHM ON HADOOP CLUSTER
A. Ezhilvathani
1
, Dr. K. Raja
2
1
P.G Student, M.E CSE, Alpha College of Engg, Chennai, India
2
Dean (Academics), Alpha College of Engg, Chennai, India
Abstract— Nowadays due to rapid growth of data in organizations, large scale data processing is a focal
point of information technology. To deal with this advancement in data collection and storage technologies,
designing and implementing large-scale parallel algorithm for Data mining is gaining more interest. In Data
mining, association rule learning is a popular and well researched method for discovering interesting
relations between variables in large databases. This paper aims to extract frequent patterns among set of
items in the transaction databases or other repositories. Apriori algorithms have a great influence for finding
frequent item sets using candidate generation. Apache Hadoop software framework is used to build the
cluster. It working is based on MapReduce programming model. It is used to improve the processing of large-
scale data on high performance cluster. It processes vast amount of data in parallel on large cluster of
computer nodes. It provides reliable, scalable, distributed computing.
Key Terms: - Hadoop; MapReduce; Apriori
I. INTRODUCTION
Data mining can be defined as the process of discovering hidden pattern in database. The main aim of the
data mining is to manipulate the data into knowledge. Association rule mining is a kind of data mining process.
Association rule mining is done to extract interesting correlations, patterns, associations among items in the
transaction database or other data repositories. Association rules are widely used in various areas such as
telecommunication networks, marketing and risk management, and inventory control etc. In this paper Apriori
algorithm is used to find the frequent item set in database. This is the method for finding the set of all possible
combination of items and then counts the support for them. The parallel association rule mining can be
categorized in two sections [5,9]. The first is data parallelism in which the input data set could be divided among
the participating node to generate the rules. The second method is of dividing the task among the nodes so that
each node will access the whole input data set for generating the rules.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a
distributed computing environment. It is part of the Apache project sponsored by the Apache Software
Foundation. Hadoop was originally conceived on the basis of Google's MapReduce, in which an application is
broken down into numerous small parts [10]. Hadoop can provide much needed robustness and scalability
option to a distributed system as Hadoop provides inexpensive and reliable storage. The Apache Hadoop
software library can detect and handle failures at the application layer, so it can deliver a highly-available
service on top of a cluster of computers, each of which may be prone to failures.
II. RELATED WORKS AND EXISTING MODEL
The Nirali R, Sheth and J. S. Shah has implemented Association Rule based parallel data mining algorithm
which deals with Hadoop cloud, a parallel store and computing platform [1].