Speeding up algorithms of SOM Family for Large and High Dimensional Databases Ernesto Cuadros-Vargas 1 , Roseli Ap. Francelin Romero 1 , Klaus Obermayer 2 1 ICMC-University of S˜ ao Paulo, Cx Postal 668 S˜ ao-Carlos-SP, Brazil 2 Dept. of Electrical Engineering and Computer Science Technische Universit¨ at Berlin, 10587 Berlin, Germany +55-16-273-9661, FAX +55-16-273-8118 {ecuadros, rafrance}@icmc.usp.br, oby@cs.tu-berlin.de Keywords: Spatial Access Methods, Self-Organizing Maps AbstractIn this paper, Spatial Access Methods, like R-Tree and k-d Tree, for indexing data, are used to speed up the training process and performance of data analysis methods which learning algorithms are kind of competitive learning. Often, the search for the winning neuron is performed sequentially, which leads to a large number of operations. Instead of using the common sequential determination of the winning neuron, which has a computational complexity of O(N ) (where N is the number of candidate units to be the winner), the approach proposed here allows to find the winning neu- ron in, approximately, log N steps. Results obtained by incorporating k-d-tree, R-Tree into Self-Organizing Maps are presented and compared with their sequen- tial counterpart implementation of SOM. The methods of SOM family used are: k-means, Kohonen network and GNG network. Several database has been used for demonstrating that a dramatic speed up can be achieved, what is very significant when large-scale and high dimensional databases are being considered. 1 Introduction Many artificial neural network models adopt the com- petitive learning for updating their weights. This kind of learning algorithm finds to the winning among sev- eral units, in a sequential way, which leads to a large number of operations. Self-Organizing Maps (SOMs) are example of such networks. They are methods for data analysis which combine the grouping of data with an embedding in a low dimensional space for the purpose of visualiza- tion. SOMs include standard clustering algorithms (k- Means) as the limiting case that - for every data point only one - weight vector is updated at one iteration. Let us now consider the case of metric data, i.e. the case that every data point is characterized by a fea- ture vector. Assignment of data points to clusters is performed by (i) calculating the distance between the data point and the weight vector of the units, and (ii) by assigning the data point to the unit whose weight vector is closest. In general, distance calculations have to be performed for every data point and for every iteration during the learning process, and can be com- putationally expensive in particular if the dimension of data space is high. The involved computational cost has a computational complexity of O(N ) and this can imply to a high computation time when these methods are applied to large-scale databases. On the other hand, Spatial Access Methods (SAM) [11], which are used for information retrieval in large databases [1], perform a hierarchical partition- ing of input space and arrange the partitions in a tree structure. Then this tree can also be used as a search tree for to retrieve objects, like the weight vector of a unit in SOM. If the tree is constructed in an efficient way, the number of visited nodes could be O(log m (N )) where N is the number of vectors inserted in the neural network and m is the number of entries per node. In [7], it was introduced a family of dual kdtree traversal algorithms for accelerating a wide class of statistical methods that are naively quadratic in the number of datapoints. However, it was used only the k-d-tree structure [2], which is widely known as an inef- ficient SAM due to the unbalanced tree generated and it was not address potential speedups during learning and adaptation. It is known that one standard method of spatial in- dexing is the R-Tree [8], which is frequently considered as the main reference method for new SAMs. R-Trees provide good results if the dimension of the feature space is lower than approximately 20. If the dimen- sion is larger than this, other techniques like OMNI [12] should be preferred. The reader can see [11, 3] for more references about these and other access methods. Recently, a SAM-SOM family has been proposed by us [4] in which R-Tree has been used to reduce dra- matically the number of comparisons of each of the N points in a dataset with each other point. It has been proposed an improved search procedure applied to SOM, based on R-Trees, which reduces the com- putational complexity for log(N ) instead of O(N ) in searching for the winning neuron. In the present paper, this idea has been extended