Parallelism in Knowledge Discovery Techniques Domenico Talia DEIS, Universit`a della Calabria, Via P. Bucci, 41c 87036 Rende, Italy talia@deis.unical.it Abstract. Knowledge discovery in databases or data mining is the semi- automated analysis of large volumes of data, looking for the relationships and knowledge that are implicit in large volumes of data and are ’inter- esting’ in the sense of impacting an organization’s practice. Data mining and knowledge discovery on large amounts of data can benefit of the use of parallel computers both to improve performance and quality of data selection. This paper presents and discusses different forms of par- allelism that can be exploited in data mining techniques and algorithms. For the main data mining techniques, such as rule induction, clustering algorithms, decision trees, genetic algorithms, and neural networks, the possible ways to exploit parallelism are presented and discussed in detail. Finally, some promising research directions in the parallel data mining research area are outlined. 1 Introduction Today the information overload is a problem like the shortage of information. In our daily activities we often deal with flows of data much more larger than we can understand and use. Thus we need a way to sift those data to extract what is interesting and relevant for our activities. Knowledge discovery in databases, also called data mining, is the semi-automated analysis of large volumes of data, looking for the relationships and knowledge that are implicit in large volumes of data and are ’interesting’ in the sense of impacting an organization’s practice. Research and development work in the area of knowledge discovery and data mining concerns the study and definition of techniques, methods, and tools for the extraction of novel, useful, and implicit patterns from data. Knowledge discovery in large data repositories can find what is interesting in them representing it in an understandable way [3]. Mining large data sets requires large computational resources. In fact, data mining algorithms working on very large data sets take very long times on conventional computers to get results. One approach to reduce response time is sampling. But, in some case reducing data might result in inaccurate models, in some other case is not useful (e.g., outliers identification). The other approach is parallel computing. High performance computers and parallel data mining algorithms can offer a very efficient way to mine very large data sets [8] [17] by analyzing them in parallel. Is not uncommon to have sequential data mining applications that require several days or weeks to complete their task. Parallel computing systems can J. Fagerholm et al. (Eds.): PARA 2002, LNCS 2367, pp. 127–136, 2002. c Springer-Verlag Berlin Heidelberg 2002