Improving Distributed Data Mining Techniques by Means of a Grid Infrastructure Alberto S´ anchez, Jos´ e M. Pe ˜ na, Mar´ ıa S. P´ erez, V´ ıctor Robles, and Pilar Herrero Facultad de Inform´ atica, Universidad Polit´ ecnica de Madrid, Madrid, Spain Abstract. Nowadays, the process of data mining is one of the most important topics in scientific and business problems. There is a huge amount of data that can help to solve many of these problems. However, data is geographically distributed in various locations and belongs to several organizations. Further- more, it is stored in different kind of systems and it is represented in many formats. In this paper, different techniques have been studied to make easier the data mining process in a distributed environment. Our approach proposes the use of grid to improve the data mining process due to the features of this kind of systems. In addition, we show a flexible architecture that allows data mining applications to be dynamically configured according to their needs. This architecture is made up of generic, data grid and specific data mining grid services. Keywords: Data Mining, Grid Computing, Data Grid, Distributed Data Mining, Data Mining Grid. 1 Introduction Data mining is characterized to be a complex process. There are two main characteristics that highlight this complexity. First, there are many non-trivial tasks involved in a stan- dard data mining process. These tasks involve different activities like data preprocessing, rule induction, model validation and result presentation. A second determinant factor of data mining problems is the volume of the datasets they deal with. Modern data mining systems are state-of-the-art applications that use advanced dis- tributed technologies like CORBA, DCOM or Java-oriented platforms (EJB, Jini and RMI) to distribute data mining operations on a cluster of workstations or even all over the Internet. Distribution is a very important ally in the resolution of data mining prob- lems. There are two main reasons to distribute data mining: (i) On the one hand, the efficient use of multiple processors to speed up the execution of heavy data mining tasks and (ii) On the other, there is originally distributed data that cannot be integrated into a single database due to technical or privacy restrictions. The requirements of high performance data mining have been studied by some re- searchers. Maniatty, Zaki and others [26,38] collected the most important technological factors both hardware and software for data mining. Hardware support for redundant disks (RAID) and processor configurations (SMP computers and NUMA architectures) are mentioned. Within software contributions, parallel/distributed databases, parallel I/O and file systems are identified as appropriate data storages. Additional factors such as communication technologies like MPI (Message Passing Interface), CORBA, RMI or R. Meersman et al. (Eds.): OTM Workshops 2004, LNCS 3292, pp. 111–122, 2004. c Springer-Verlag Berlin Heidelberg 2004