Discovering Decision Rules from Numerical Data Streams ∗ Francisco Ferrer–Troyano Dept. of Computer Science University of Seville, Spain ferrer@lsi.us.es Jesús S. Aguilar–Ruiz Dept. of Computer Science University of Seville, Spain aguilar@lsi.us.es José C. Riquelme Dept. of Computer Science University of Seville, Spain riquelme@lsi.us.es ABSTRACT This paper presents a scalable learning algorithm to classify nu- merical, low dimensionality, high–cardinality, time–changing data streams. Our approach, named SCALLOP, provides a set of de- cision rules on demand which improves its simplicity and helpful- ness for the user. SCALLOP updates the knowledge model every time a new example is read, adding interesting rules and remov- ing out–of–date rules. As the model is dynamic, it maintains the tendency of data. Experimental results with synthetic data streams show a good performance with respect to running time, accuracy and simplicity of the model. Keywords Decision rules, scalable algorithms, data streams 1. INTRODUCTION Medicine, meteorology, ATM transactions, retail chains, or sci- entific projects are some examples of different fields where giga- bytes of numerical data streams are daily generated and mined un- der the assumption that they hide valuable information. Progress in hardware storage and data–warehouse technologies allow mod- ern organizations to collect vast amounts of data from proprietary and client case histories. The inherent nonstop data traffic among heterogeneous sources gives rise to noise, missing, and inconsist- ency on attribute values. In addition, when data distribution is not stationary (examples are collected over months), algorithms based on data partitioning techniques (instance/feature sampling) are oversensitive to both underfitting and overfitting. Furthermore, memory and time limitations compel such systems to give an ap- proximate answer from few scans (ideally only one) assuring that both result and performance are not adversely affected by the order of the examples. Mining potentially infinite data sequences im- plies high computational cost and usually results in large, complex and incomprehensible knowledge models, so interactive and user– controlled systems are becoming increasingly developed moving ∗ The research has been supported by the Spanish Research Agency CICYT under grant TIC2001-1143-C03-02. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’04, March 14–17, 2004, Nicosia, Cyprus. Copyright 2004 ACM 1-58113-812-1/03/04 ...$5.00. on the user’s priorities to less accurate but more comprehensible answers. For all these reasons, designing new scaling–up and scal- able learning algorithms has consolidated as an important challenge in recent years [11]. This paper introduces a scalable classification algorithm named SCALLOP (Scalable Classification ALgorithm by Learning deci- siOn Patterns) that provides a model on demand according to sev- eral user–defined parameters. In the next sections we describe the motivation and the basis of our approach, discussing its major draw- backs. Next we present experimental results on numerical data sets that show its performance mining numerical, low–dimensionality, high–speed data steams. 2. MOTIVATION Many scalable learning algorithms are based on decision trees, modelling the whole search space hierarchically as disjointed hy- percubes. The highly complex trees given by these systems cast doubts on its capabilities as suitable knowledge representation due to the user need to explore paths of several dozen of levels to know interesting patterns. In addition, mining time–changing data streams may involve to rebuild an out–of–date sub–tree, increasing the com- putational cost to a greater extent. Within incremental learning, a common approach to extract the concepts to be learned consists in repeatedly applying the learner to a sliding window of w examples. An important issue of these approaches is to find the best value for w that optimizes the performance as a function of the input data [13]. Our proposal obtains a reduced set of updated decision rules sor- ted in a relevance order according to the user’s demand. From several user–defined parameters, SCALLOP only models the re- gions whose characteristics interest the user, showing visually the obtained rules (see Figure 1). Contrary to decision–tree–based ap- proaches, the whole search space is not modelled. Using a window of size 1, those examples located inside the most influential regions turn into hypercubes and extend its limits to the nearest different la- bel regions. This approach makes the model to be initially unstable since some rules could be wrongly expanded, intersecting different labelled regions whose examples have not been read yet. To attain the stabilization of the model, SCALLOP associates growth limits with each rule preventing them to be extended. Such growth limits give an excellent way to classify by voting with a reduced set of rules, differently to decision lists. 3. THE SCALLOP ALGORITHM Classification is generally defined as follows. An input finite data set of training examples is given. Every training example is a pair e =(x, y) where x is a vector of m attribute values (each of which may be numeric or symbolic) and y is a class discrete value 649 2004 ACM Symposium on Applied Computing