Applied Intelligence 20, 21–35, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The United States. Nearest-Neighbours for Time Series JUAN MANUEL GIMENO ILLA Departament d’Inform ` atica i Enginyeria Industrial, Universitat de Lleida jmgimeno@eup.udl.es JAVIER B ´ EJAR ALONSO AND MIQUEL S ` ANCHEZ MARR ´ E Departament de Llenguatges i Sistemes Inform ` atics, Universitat Polit` ecnica de Catalunya bejar@lsi.upc.es miquel@lsi.upc.es Abstract. This paper presents an application of lazy learning algorithms in the domain of industrial processes. These processes are described by a set of variables, each corresponding a time series. Each variable plays a different role in the process and some mutual influences can be discovered. A methodology to study the different variables and their roles in the process are described. This methodology allows the structuration of the study of the time series. The prediction methodology is based on a k -nearest neighbour algorithm. A complete study of the different param- eters of this kind of algorithm is done, including data preprocessing, neighbour distance, and weighting strategies. An alternative to Euclidean distance called shape distance is presented, this distance is insensitive to scaling and translation. Alternative weighting strategies based on time series autocorrelation and partial autocorrelation are also presented. Experiments using autorregresive models, simulated data and real data obtained from an industrial process (Waste water treatment plants) are presented to show the feasabilty of our approach. Keywords: lazy learning, distance functions, feature weighting 1. Introduction In the recent years, there has been an increasing effort in the machine learning and data mining communities to deal with problems in which the temporality of the data is not only important but unavoidable (learning in stock market data, profiling web-surfer behaviour, etc). Among these domains, there are many industrial processes involving human control and interaction, in which time-series forecasting is important. Traditional methods from statistics have been applied to these prob- lems (Box-Jenkins methodology [1], Kalman filtering [2]) and, recently, some methods from statistical learn- ing theory (Support Vector Machines [3]). The aim of our research is to include predictive abil- ities to a multi-strategic learning architecture (DAI- DEPUR [4]). This architecture is mostly based in case- based reasoning [5] and, for the selection of the case corresponding to the actual situation, an algorithm sim- ilar to nearest-neighbours is used. Besides that, both algorithms work with the raw data generated by direct measurements of the system being monitored. Both methods are lazy-learners [6] and use local information in order to predict. They have been applied in many research areas. This kind of learning does not make any hypotheses about the model of the data (it only stores past values without making any supposition about them [7]). Lazy learning methods are good methods for pre- diction of time series because they do not suppose a global model for the dynamic process underlying the time series. This allows to apply this kind of