Efficiently Clustering Transactional Data with Weighted Coverage Density Hua Yan Computational Intelligence Laboratory School of Computer Science and Engineering University of Electronic Science & Tech. of China Chengdu, 610054 P.R.China huayan@uestc.edu.cn Keke Chen College of Computing Georgia Institute of Technology Atlanta, GA30280 USA kekechen@cc.gatech.edu Ling Liu College of Computing Georgia Institute of Technology Atlanta, GA30280 USA lingliu@cc.gatech.edu ABSTRACT In this paper, we propose a fast, memory-efficient, and scal- able clustering algorithm for analyzing transactional data. Our approach has three unique features. First, we use the concept of Weighted Coverage Density as a categorical simi- larity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop two transactional data clustering specific eval- uation metrics based on the concept of large transactional items and the coverage density respectively. Third, we im- plement the weighted coverage density clustering algorithm and the two clustering validation metrics using a fully au- tomated transactional clustering framework, called SCALE (Sampling, Clustering structure Assessment, cLustering and domain-specific Evaluation). The SCALE framework is de- signed to combine the weighted coverage density measure for clustering over a sample dataset with self-configuring methods that can automatically tune the two important pa- rameters of the clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results. We have con- ducted experimental evaluation using both synthetic and real datasets and our results show that the weighted cov- erage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner. Categories and Subject Descriptors I.5.3 [Computing Methodologies]: Pattern Recognition- Clustering Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’06, November 5–11, 2006, Arlington, Virginia, USA. Copyright 2006 ACM 1-59593-433-2/06/0011 ...$5.00. General Terms Algorithms Keywords Weighted Coverage Density, AMI, LISR, SCALE 1. INTRODUCTION Transactional data is a kind of special categorical data, which can be transformed to the traditional row by column table with Boolean values. Typical examples of transac- tional data are market basket data, web usage data, cus- tomer profiles, patient symptoms records, and image fea- tures. Transactional data are generated by many applica- tions from areas, such as retail industry, e-commerce, health- care, CRM, and so forth. The volume of transactional data is usually large. Therefore, there are great demands for fast and yet high-quality algorithms for clustering large scale transactional datasets. A transactional dataset consists of N transactions, each of which contains varying number of items. For example, t1 = {milk, bread, beer} and t2 = {milk, bread} are three-item transaction and two-item transaction respectively. A trans- actional dataset can be transformed to a traditional cate- gorical dataset (a row-by-column Boolean table) by treating each item as an attribute and each transaction as a row. Al- though generic categorical clustering algorithms can be ap- plied to the transformed Boolean dataset, the two key fea- tures of such transformed dataset: large volume and high dimensionality, make the existing algorithms inefficient to process the transformed data. For instance, a market basket dataset may contain millions of transactions and thousands of items, while each transaction usually contains about tens of items. The transformation to Boolean data increases the dimensionality from tens to thousands, which poses signifi- cant challenge to most existing categorical clustering algo- rithms in terms of efficiency and clustering quality. Recently, a number of algorithms have been developed for clustering transactional data by utilizing specific features of transactional data, such as LargeItem [21], CLOPE [23], and CCCD [22]. However, all of the existing proposals suffer from one obvious drawback. All proposed clustering algo- rithms require users to manually tune at least one or two parameters of the clustering algorithms in order to deter- 367