A bitmap based apriori algorithm on Graphic processors Viraj Sandaruwan, Kaushalya Madhawa, Madawa Jeewananda, Chaminga Malmi Department of Computer Science and Engineering University of Moratuwa, Sri Lanka Abstract Modern day data mining is an extremely challenging task as the availability of the worldwide information grows exponentially. To address this, it is required to come up with more powerful and efficient computing solutions. In this paper we present a novel parallel data mining solution based on the Apriori algorithm which utilizes the massively parallel SIMD (Single Instruction, Multiple Data) architecture of the GPUs (Graphics Processing Units). The solution is designed to employ both CPU and GPU for processing of the data and it utilizes bitmaps to enhance the performance. The GPU carries out the operations of bitmaps while the CPU is responsible for candidate item set generation, handling inputs and outputs to the GPU and other tasks. Keywords-association rule mining; apriori; General purpose GPU; data mining; parallel I. INTRODUCTION Association rule learning is a popular data mining technique used in various fields. It focuses on discovering interesting relations between variables in large data bases. The Apriori algorithm is the most popular technique used in association rule learning. In Apriori, given a set of item sets (for instance, sets of retail transactions, each listing individual items purchased), the algorithm attempts to find subsets which are frequent than a minimum threshold number. Apriori uses a bottom up approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. The performance of data mining has been improved by using various parallel architectures. In this paper we present a data mining system which utilizes the parallelism of modern GPUs. GPUs are massively multi threaded many core processors. Unlike multi core CPUs, the cores of the GPU are virtualized. The GPU threads are managed by the hardware and are executed in SIMD (Single Instruction, Multiple Date) pattern. The GPU is a co-processor to the CPU. Therefore, the system is developed in such a way that a CPU-based storage and buffer manager to handle disk I/O as well as data transfers between the GPU and the CPU memory. Since the SIMD pattern can be exploited by using an aligned and sequential data access we have utilized bitmap data structures to represent transactions in our system. Furthermore we have used a dynamic array data structure to store the count for each item set since number of item sets varies. II. RELATED WORK A. Association rule mining The problem of association rule mining was first introduced by Agrawal et al. in 1993 [1]. The rule mining in apriori algorithm is based on the observation that subsets of frequent itemsets must be frequent as well. The algorithm extends frequent itemsets by one item at a time and tests the candidates against the data. The algorithm terminates when no further successful extension is possible. An example of such an association rule is that 90% of transactions that purchase bead and milk also purchase butter. Apriori algorithm is defined formally as: Let I= i 1, i 2,. ... ,i n be a set of n binary attributes called items. Let D=t 1, t 2,. ... ,t n be a set of transactions called the database. Each transaction in D ( t i )has a unique transaction ID and contains a subset of the items in I. A rule s defined as an implication of the form A B where X,Y I and X Y= .The sets of items (for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent(right-hand-side or RHS) of the rule respectively. An k-itemset that consists of k items from I, is frequent if it occurs in T not less than s times, where s is a underspecified minimum support threshold, and s n . The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset . Confidence of the rule X Y is defined as conf X Y =supp X Y / supp X