Data Mining and Knowledge Discovery 3, 171–195 (1999) c 1999 Kluwer Academic Publishers. Manufactured in The Netherlands. A Scalable Parallel Algorithm for Self-Organizing Maps with Applications to Sparse Data Mining Problems R.D. LAWRENCE lawrence@watson.ibm.com G.S. ALMASI almasi@watson.ibm.com H.E. RUSHMEIER holly@watson.ibm.com IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, New York 10598 Editor: Heikki Mannila Abstract. We describe a scalable parallel implementation of the self organizing map (SOM) suitable for data- mining applications involving clustering or segmentation against large data sets such as those encountered in the analysis of customer spending patterns. The parallel algorithm is based on the batch SOM formulation in which the neural weights are updated at the end of each pass over the training data. The underlying serial algorithm is enhanced to take advantage of the sparseness often encountered in these data sets. Analysis of a realistic test problem shows that the batch SOM algorithm captures key features observed using the conventional on-line algorithm, with comparable convergence rates. Performance measurements on an SP2 parallel computer are given for two retail data sets and a publicly available set of census data. These results demonstrate essentially linear speedup for the parallel batch SOM algorithm, using both a memory-contained sparse formulation as well as a separate implementation in which the mining data is accessed directly from a parallel file system. We also present visualizations of the census data to illustrate the value of the clustering information obtained via the parallel SOM method. Keywords: parallel processing, parallel IO, scalable data mining, clustering, Kohonen self-organizing maps, data visualization 1. Introduction The self-organizing map (SOM) (Kohonen, 1985, 1995) is a neural network model that is capable of projecting high-dimensional input data onto a low-dimensional (typically two- dimensional) array. This nonlinear projection produces a two-dimensional “feature map” that can be useful in detecting and analyzing features in the input space. SOM techniques have been successfully applied in a number of disciplines including speech recognition (Kohonen, 1988), image classification (Lu, 1994), and document clustering (Lagus et al., 1996; Honkela et al., 1998). An extensive bibliography of SOM applications is given in (Kohonen, 1985) and is also available at (Kohonen et al., 1995). Neural networks are most often used to develop models that are capable of predicting or classifying an output as a response to a set of inputs to the trained network. Supervised learning is used to train the network against input data with known outputs. In contrast,