Mining Rank-Correlated Sets of Numerical Attributes Toon Calders Department of Mathematics and Computer Science University of Antwerp toon.calders@ua.ac.be Bart Goethals Department of Mathematics and Computer Science University of Antwerp bart.goethals@ua.ac.be Szymon Jaroszewicz Szczecin University of Technology National Institute of Telecommunications Szachowa 1, 04-894, Warsaw, Poland sjaroszewicz@wi.ps.pl ABSTRACT We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on exten- sions of Kendall’s tau, and Spearman’s Footrule and rho. We show how these support measures are related. Furthermore, we introduce a novel type of pattern combining numerical and categorical attributes. We give efficient algorithms to find all frequent patterns for the proposed support measures, and evaluate their performance on real-life datasets. Categories and Subject Descriptors: H.2.4 [Database Management]:Systems I.2.6[Artificial Intelligence]:Learning Knowledge Acquisition General Terms: Algorithms, Experimentation, Theory. Keywords: Data mining, Numerical, Rank Correlation. 1. INTRODUCTION The motivation for the research reported upon in this pa- per is an application where we want to mine frequently oc- curring patterns in a meteorological dataset containing mea- surements from various weather stations in Belgium over the past few years. Each record contains a set of measurements (such as temperature or pressure) taken in a given station at a given time point, together with extra information about the stations (such as location or altitude). The classical association rule framework [1], however, is not adequate to deal with numerical data directly. Most previous approaches to association rule mining for numer- ical attributes were based on discretization, see for exam- ple [16]. Discretization, however, has serious disadvantages. First of all it always incurs an information loss, since val- ues falling in the same bucket become indistinguishable and small differences in attribute value become unnoticeable. On the other hand, very small changes in values close to a dis- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’06, August 20–23, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-339-5/06/0008 ...$5.00. cretization border may cause unjustifiably large changes in the set of active rules. Second, if there are too many dis- cretization intervals, discovered rules are replicated in each interval making the overall trends hard to spot. It is also possible that rules will fail to meet the minimum support criterion when they are split among many narrow intervals. In [16] a method for merging narrow intervals into wider ones was combined with a special scheme to prune spuri- ous rules. The method, however, cannot entirely solve the problems related to discretization, since it is impossible to decide with certainty which rules were true associations and which were just artifacts of discretization. Also, informa- tion loss and instability at interval borders is inherent to discretization and cannot be eliminated entirely. To tackle the problem of mining the meteorological dataset without relying on discretization methods, we propose a new technique based on well established statistical studies of rank correlation measures [12, 13]. More specifically, we propose to compare attributes by the rank their values im- pose on the records in the database. For example, for a given set of attributes, this can be done by counting the number of pairs of records such that all attributes rank the first tuple higher than the second tuple. When this num- ber is high, it gives a clear indication that the attributes in the set behave similarly, and hence, reveals an interesting pattern. As it turns out, this number is related to the well known Kendall’s τ [12] rank correlation measure, which will be thoroughly explained in the next section. Some examples of the types of rules we are able to discover are the following: given two records t1 and t2, If the altitude of the sun in t1 is higher than in t2, then temperature is likely to be higher as well. If t1 comes from a weather station in Antwerp, and t2 from Brussels, and wind speed in t1 is higher than in t2, then it is likely that cloudiness is higher as well. The main contributions of our paper are as follows: 1. We propose three new support measures for sets of numerical attributes; supp τ , supp ρ , and supp F , which are based on well-known statistical rank correlation measures, i.e., respectively Kendall’s τ , Spearman’s ρ and Spearman’s Footrule F [12, 13]. 2. We show how to combine the mining of sets of numer- ical attributes with ordinal and categorical attributes and how to extend it to association rules.