ELECTRONICS LETTERS 13th February 1997 Vol. 33 No. 4 Bayesian learning for self-organising maps H. Yin and N.M. Allinson Indexing terms: Bayes methods, Learning systems An extended self-organising learning scheme is proposed, namely the Bayesian self-organising map (BSOM), in which both the distance measure and neighbourhood function have been replaced by the neuron’s ‘on-line’ estimated posterior probabilities. Such posteriors, in a Bayesian inference sense, will then contribute to gradually sharpening the estimation for input distributions and model parameters for which generally there is little prior knowledge. The BSOM has been successfully used to learn the underlying mixture distribution of input data, and hence form an optimal pattern classifier. Introduction: The self-organising map (SOM) [1] has been proved to be potentially optimal to vector quantisation (VQ) [2]. When used as a classifier, however, it will not yield optimal Bayesian classification unless the input data are uniformly distributed or patterns are well separated. Like the k -means algorithm, the SOM often results in more classification errors than a Bayes classifier. To form a better classifier, the SOM needs some form of supervision to label and adjust pre-turned weights as in the LVQ1, LVQ2, and LVQ3 [1], or via another linear supervised layer. In the following, a simple and efficient method that incorporates the Bayesian learning within the SOM structure is proposed, so that the network can estimate the underlying mixture distribution in an unsupervised manner. Bayesian self-organising maps: The joint pattern distribution in many pattern classification problems can be considered as a mixture distribution model, which consists of class-conditional probability density functions (PDFs) weighted by mixing parameters [3]. In practice, the form of the class-conditional PDF may be or can be assumed as known, but the model parameters for each PDF and the mixing weights need to be estimated from data samples. In a super- vised scheme, i.e. each data sample possesses its class label, the esti- mation of these model parameters and mixing weights is trivial. In an unsupervised learning, however, such a task is very difficult, espe- cially when the class-conditional PDFs overlap. Like the k -means algorithm, the SOM either implies the equal prior and uniform conditionals or does not use any prior or poste- rior information about each class and its PDF. These may make the SOM non-optimal in classification applications. An optimal classi- fier needs a good estimation for pattern distributions. Bayesian infer- ence involves on-line estimation of the class posteriors from input samples, so that the estimation for the class PDF and model param- eters can be gradually improved from their poor or coarse initial estimates on priors [4]. In the following, such a Bayesian learning is incorporated within the SOM through its neighbourhood function. The posterior distribution for each class, which is assumed to be the same and relatively flat at the beginning, is estimated on-line and gradually sharpened as more and more data become available. This posterior constrains the neighbourhood function, thus limits a parameter’s learning from the input to its own posterior percentage. When input data are Gaussian mixtures, the mean vector and covar- iance are sufficient statistics for each class-conditional distribution. Therefore, each neuron acts as a Gaussian kernel, with not only weights for means, but also weights for covariance matrix and priors, to learn such a mixture. Such an extended SOM algorithm was first proposed by the authors [5], and is termed here Bayesian SOM (BSOM). For a K-finite mixture of Gaussians, the network Y places K units in the input space X. Each unit is a Gaussian kernel, with its mean vector I , covariance matrix I , and prior (or mixing parameter) (c I ), as learning weights. At each time step n, a sample, denoted by x(n), is randomly taken from X. A winner is chosen according to its kernel output, i.e. estimated posterior probability, i.e. Then within a neighbourhood of the winner, η ν , the weights are updated according to the following rules: where the adaptive gain, α(n), is the same as in the original SOM, ν is the winner at time n, and η is the neighbourhood’s effective range. All estimated prior probabilities, however, have to be updated at each iteration in order to keep the constraint P(c I ) = 1. As can be seen, the neighbourhood function has been replaced by an on-line estimated posterior probability and it operates through- out the entire learning process. The learning in the BSOM is propor- tional to the posterior of the corresponding neuron, which is defined in the input space rather than in the neuron space and is a function of the input data and model parameters. Theoretically all neurons should undergo such updating at each iteration if their class-condi- tional PDFs are not clear bounded. However, the topological order- ing (even locally) property of the SOM ensures that the posteriors of the components outside the neighbourhood are very small. There- fore, The convergence of the BSOM algorithm can be obtained by extending an SOM convergence proof [2] and by making a compari- son between the BSOM and Luttrell’s hierarchical VQ theorem [6]. Like the E-M method [7] improves the k -means algorithm, the BSOM improves the original SOM’s pattern learning and classifica- tion ability at little additional computational cost. However, the E- M method is a batch updating process, while the BSOM is an adap- tive on-line learning process. Experimental results: (i) Tex ture classification: Two texture images from Brodatz album [8], Treebark (D12) and Pigskin (D92), were used in this test. A 30 × 30 pixel window-based Markov random field parameters were used as texture features. 50 such feature sets were taken randomly from each texture. Only the first order parameters were used in the test, so we have a clear view of the overlapping clus- ters in 2D space, as shown in Fig. 1. To apply the BSOM algorithm, the initial means were set ran- domly within the data range, initial priors were set equal, and initial variance matrices were also set equal with fairly large (comparable to data variance) diagonals, e.g. m ˆ Σ ˆ P ˆ Σ K i =1 Fig. 1 Feature distributions (first order M R F parameters) × Pigskin + Treebark