Using Active Networks in Parallel Mining of Association Rules Qin Ding and William Perrizo Computer Science Department, North Dakota State University Fargo, ND 58105-5164 Abstract Association rule mining is one of the important data mining tasks. To improve efficiency, association rules can be mined in parallel. In parallel association rule mining, processors need to communicate with each other frequently to exchange information about local support count or local frequent itemsets based on which parallel algorithm is used. To reduce the communication cost, we propose a new approach to apply active networks technique in parallel mining of association rules. In an active network, the routers or switches of the network perform customized computations on the messages flowing through them. Two algorithms, AN-CD and AN- DD are proposed to use active node for calculating global counts and global frequent itemsets respectively. Performance analysis shows that applying these two algorithms for parallel mining of association rules, the total number of messages are obviously reduced from O (N 2 ) to O (N), where N is the total number of processors, thus decreasing the communication cost and the entire execution time. Our work is particularly useful in mining association rules over wide-area networks. Keywords: Association rule mining, active networks, parallel mining 1. Introduction Association rule mining is one of the important problems of data mining. An association rule implies certain association relationship among a set of objects. Association rule mining has broad applications in decision support and marketing strategy. A typical example of association rule is that 90% of customers who purchase bread also purchase butter and milk. This example sounds like common sense knowledge; however, there could be a lot of associations that may not be able to be deduced from common knowledge [11]. As the database size becomes larger and larger, a better way is to mine association rules in parallel. Some parallel association rule mining algorithms are proposed [3, 8], Count Distribution and Data Distribution are two of them. In Count Distribution algorithm, local counts are calculated by individual processors first. To calculate the global counts, each processor needs to communicate with every other processor. In Data Distribution algorithm, each processor generates local frequent itemsets and then exchanges them with all the other processors to get global frequent itemsets. The communication cost will be increased obviously when the number of processors is increased. Also the communication cost will be increased when the processors are in the wide-area network instead of local-area network. To solve this problem, we can apply active networks technique on parallel association rule mining. Active networks are a novel approach to network architecture in which the switches of the network perform customized computations on the messages flowing through them [9]. These switches are called active nodes. The networks are active in the sense that nodes can perform computations on, and modify, the packet contents. In addition, this processing can be customized on per user or per application basis. In contrast, the role of computation within traditional packet networks is extremely limited. The idea of using active networks in parallel association rule mining is that each processor sends message to the active node instead of all the other processors. The active node can perform functions to calculate the global count or global frequent itemsets and send the result to all the processors. This will reduce the communication cost and also the response time. It’s particularly useful in mining association rules over wide-area networks. The rest of the paper is organized as follows. In Section 2, we review some basic concepts and algorithms of association rule mining. In Section 3, the framework of active networks technique is described. In Section 4, based on count distribution and data distribution algorithms, we propose two new algorithms, AN-CD and AN-DD, which applies active network fusion technique to reduce communication cost. Performance analysis is given in Section 5. Finally we give the conclusion in Section 6. 2. Serial algorithm of association rule mining 2.1 Association Rules The basic problem of finding association rules is introduced in [1]. Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items (called “itemset”) such that T⊆ I. We say that a transaction T contains X, a set of some items in I, if X ⊆ T. An association rule is an implication of the form X => Y, where X ⊂ I, Y ⊂ I, and X ∩ Y= ∅. The rule X => Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. The rule X => Y has support s in the transaction set D if s% of transactions in D contain X ∪ Y. Given a set of transactions D, the problem of mining association rules is to generate all association rules that