DOI: 10.4018/IJIRR.2017100103
International Journal of Information Retrieval Research
Volume 7 • Issue 4 • October-December 2017
Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Frequent Itemset Mining in
Large Datasets a Survey
Amrit Pal, Indian Institute of Information Technology, Allahabad, India
Manish Kumar, Indian Institute of Information Technology, Allahabad, India
ABSTRACT
Frequent Itemset Mining is a well-known area in data mining. Most of the techniques available for
frequent itemset mining requires complete information about the data which can result in generation
of the association rules. The amount of data is increasing day by day taking form of BigData, which
require changes in the algorithms for working on such large-scale data. Parallel implementation of the
mining techniques can provide solutions to this problem. In this paper a survey of frequent itemset
mining techniques is done which can be used in a parallel environment. Programming models like
Map Reduce provides efficient architecture for working with BigData, paper also provides information
about issues and feasibility about technique to be implemented in such environment.
KeywoRDS
BigData, Count, Frequent Itemset, HDFS, Map Reduce, Mapper, Reducer, Set
INTRoDUCTIoN
The amount of data is increasing day by day this increase in the size of data, developing some basic
challenges for the frequent itemset mining algorithms. As the size of the increase the amount of time
required to process the data will also increase. Millions of customers visit Walmart daily, resulting in
the generation of millions of transactions. Every hour Walmart generates approximately 2.5 petabytes
of data (DeZyre, 2016). Social network websites generating huge amount of unstructured data daily.
Managing this huge amount of unstructured data using the conventional technique is a challenging
task. The amount of data when it becomes that much in size that it becomes difficult to manage it
using conventional data management systems, then it is called Big Data (Manyika, 2011). Transaction
datasets are also increasing in size and taking the shape of Big Data. There are algorithms available
for mining of the frequent itemsets from transactional datasets like Apriori (Agrawal, 1994), FP-
Growth etc. There can be different approaches for mining the frequent itemsets from the transactional
datasets, sequential and parallel approaches. Most of the available frequent itemset mining algorithms
consider the sequential approach.
There are some basic requirement in processing the data for frequent itemsets. These are counting
the number of transactions, counting different items in the itemset, maintain a list of items, count
of the total number of transactions and complete scan of the datasets. The basic terminology of the
frequent itemset mining is calculating the support of each itemset. Algorithms are required to scan the
37