A Matrix Approach for Association Mining * Alaaeldin M. Hafez 1 and Vijay V. Raghavan 1 * This research was supported in part by the U.S. Department of Energy, Grant No. DE-FG02-97ER1220. 1 ahafez(raghavan)@cacs.louisiana.edu, The Center for Advanced Computer Studies, University of Louisiana, Lafayette, LA70504- 4330. USA. Abstract Association Mining, a class of data mining techniques, is one of the most researched field in data mining, where algorithms are designed to discover rules that reflect dependencies among values of an attribute. Because of the vast amounts of data that businesses store, most association mining algorithms are computationally expensive, where many passes over data are performed. Besides working on the sequential processing environment, the implementation of data mining ideas should consider parallel computing environments. In this paper, a new technique is presented to perform association mining based on the matrix approach. The new technique can be applied on the sequential and parallel environments. In the proposed technique, the data records are only scanned once to construct a frequency vector and a binary association matrix. Two algorithms, one for generating only maximal large item-sets and the other for generating all large item-sets, are presented. The number of disk accesses, CPU time, and memory space needed for generating large item-sets are O(n), O(N 2 ) , and O(N), respectively, where n is the number of input transactions, and N is the number of transaction groups. Keywords: Data Mining, Association Mining, Parallel Association Mining, On-line Query Processing. 1 Introduction Knowledge discovery in databases is the process of identifying useful and novel information among vast amount of data. Data Mining is considered as the main step in the knowledge discovery process that is concerned with the algorithms used to extract potentially valuable patterns, associations, trends, sequences and dependencies in data. Association mining is one of the central tasks in data mining. Association mining is the process of producing association rules to express positive connections between attributes in a 0/1 matrix. An example of such a rule is one that states that if a customer buys milk and bread, then with 70% confidence he also buys beer. Most association mining algorithms are computationally expensive, when large data are handled. Many algorithms have been proposed to generate association rules that satisfy certain measures. Most of them do not support on-line ad-hoc queries and/or parallel processing [1, 2, 3, 4, 5, 6, 9, 11]. Very few techniques [7, 8, 10] are specially designed and implemented for parallel processing, but they still do not give advantages on on-line ad hoc queries. In this work, our main concern is to develop new scalable techniques that are efficient and portable, where the implementations of such techniques can adopt a variety of hardware platform environments such as sequential and parallel, and also could be used for handling on-line queries. Specifically, we introduce a new association mining technique that is based on the matrix approach. The proposed technique focuses on the performance issue of both on-line query processing and batch processing. In our approach, the transaction file is scanned only once. As a result of that scan, transaction groups (a transaction group is defined as all transactions having the same item- set), and their frequencies are identified. Two algorithms, one for generating only maximal large item-sets and the other for generating all large item-sets, are presented. The two techniques are the Generate large item-sets and the Generate maximal large item-sets. Both techniques start with those distinct transaction groups as item-sets, and use the matrix approach to calculate their supports. The only difference between the two techniques is that, in the former, we explore large item-sets more completely by applying intersections on large item-sets and between large and small item-sets. The rest of this paper is organized as follows. In section 2, the problem is defined. The Generate large item-sets and the Generate maximal large item-sets are given in sections 3 and 4, respectively. In section 5, the matrix approach is evaluated and the paper is concluded in section 6. 2 Problem Definition Association mining that discovers dependencies among values of an attribute was introduced by Agrawal et al.[1] and has emerged as an important research area. The problem of association mining, also referred to as the market basket problem, is formally defined as follows. Let I = {i 1 ,i 2 , . . . , i m } be a set of items and S = {s 1 , s 2 , . . ., s n } be a set of transactions, where each transaction s i ∈ S is a set of items that is s i ⊆ I. An association rule denoted