Regression on Evolving Multi-Relational Data Streams Elena Ikonomovska Advisor: Sašo Džeroski Institute Jožef Stefan Jamova cesta 39 1000 Ljubljana Slovenia ABSTRACT In the last decade, researchers have recognized the need of an increased attention to a type of knowledge discovery ap- plications where the data analyzed is not finite, but streams into the system continuously and endlessly. Data streams are ubiquitous, entering almost every area of modern life. As a result, processing, managing and learning from mul- tiple data streams have become important and challenging tasks for the data mining, database and machine learning communities. Although a substantial body of algorithms for processing and learning from data streams has been devel- oped, most of the work is focused on one-dimensional numer- ical data streams (time series) or a single multi-dimensional data stream. Only few of the existing solutions consider the most realistic scenario where data can be incomplete, corre- lated with other streams of information and can arrive from multiple heterogeneous sources. This paper discusses the requirements and the difficulties for learning from multiple multi-dimensional data streams inter- linked according to a pre-defined semantic schema (multi- relational data streams). The main research problem is to develop a time-efficient, resource-aware methodology for linking and exploring the information which is arriving in- dependently and in an asynchronous way from its respec- tive sources. The resulting framework has to enable, at any time error-bounded approximate answers to aggregate queries commonly issued in the process of multi-relational data mining. In particular we focus on the task of learn- ing regression trees and their variants (model trees, option trees, multi-target trees) from multiple correlated stream- ing sources. To the best of our knowledge, no other work has previously addressed the problem of learning regression trees from multi-relational data streams. Keywords Regression trees, multi-relational data streams, any-time learn- ing, one-pass approximate summaries Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EDBT/ICDT PhD 2011, March 25, 2011, Uppsala, Sweden. Copyright 2011 ACM 978-1-4503-0696-6/11/03 ...$10.00. 1. INTRODUCTION Most of the existing algorithms for learning on data streams are designed for data represented in the standard attribute- value (propositional) format, i.e., a stream of tuples con- tinuously populating a single data table. However, in most of the real-world and real-time knowledge discovery appli- cations the data is structured and is usually represented by multiple correlated data streams. Consider as an example a real-time financial analysis problem, where one needs to take into account a fast stream of transactions and a stream of reclamations, attached to a much slower stream of cus- tomers (new and recurring ones) and a table of products which is mostly static over longer periods of time. It is not difficult to see that, in order to properly analyze real-world data, it is necessary to take into account all the information available, as well as the relations that natu- rally exist among the data entities (e.g., the information on user purchases, navigation habits, personal data, rela- tion to other users and so on). The enriched structure of the input can greatly leverage the ability to extract useful knowledge, but makes the learning task highly challenging. The increased complexity of the input space results in an explosion of possible hypotheses that a learning algorithm needs to examine. The data stream setting by itself raises a number of issues which are not easily solved by traditional machine learning and data mining algorithms, such as the one-pass require- ment for processing each learning example using a constant processing time and limited memory. A straightforward adaptation of incremental learning is not an effective so- lution to the problem. Algorithms must be computationally efficient, resource aware, adaptive and robust. To success- fully deal with the evolving nature of data streams, models need to be continuously monitored and updated in real-time. The biggest challenge however is linking the information which is streaming from several different sources. The dif- ficulty of the task is mainly due to the fact that streams have different update speeds and are most often not syn- chronized. Most of the existing work assumes that all the information required for learning (defined by the structure of the input) arrives simultaneously, or employs computation- ally heavy methods for propositionalization over windows of most recent facts. We argue that the existing approaches do not consider all the aspects of the problem or are unable to leverage the power of multi-relational learning on data