Class structure visualization with semi-supervised growing self-organizing maps Arthur Hsu Ã , Saman K. Halgamuge Department of Mechanical Engineering, University of Melbourne, Victoria 3010, Australia article info Available online 8 July 2008 Keywords: Semi-supervised learning Self-organizing maps Class structure visualization Partially labelled data abstract We present a semi-supervised learning method for the growing self-organising maps (GSOM) that allows fast visualisation of data class structure on the 2D feature map. Instead of discarding data with missing values, the network can be trained from data with up to 60% of their class labels and 25% of attribute values missing, while able to make class prediction with over 90% accuracy for the benchmark datasets used. The proposed algorithm is compared to three variants of semi-supervised K-means learning on four real-world benchmark datasets and showed comparable performance and better generalisation. & 2008 Published by Elsevier B.V. 1. Introduction When all information regarding the measurement values and the type of the class are known, supervised learning is the primary technique that is used for building classiﬁers. The term supervised comes from the fact that when training or building classiﬁers, the predicted results from the classiﬁer are compared with the known results and the errors are fed back to the classiﬁer to improve the accuracy, like a supervisor guiding the training. In data mining terminology, supervised learning is also referred to as directed data mining. The classiﬁcation problem has the goal of maximising the generalised classiﬁcation accuracy, such that high prediction accuracy for both the training data and new data can be obtained. Further to merely boosting the classiﬁcation accuracy, it can often be useful to exploit the understanding of the class structure in the labelled data. This can be done by supervised learning of topology-preserving networks like self organizing maps (SOM) where the complexity of the class structure, in terms of similarity and degree of overlapping of classes, can be visually identiﬁed on the two-dimensional (2D) grid. However, complete data with all entries labelled and without missing measurements are always difﬁcult and expensive to gather. Therefore, it can often occur that the collected data are incomplete, missing either measurement values or labels. In classical supervised learning, these incomplete data entries are discarded, but there are many algorithms that can learn from partially labelled data (a dataset that contains both items with complete and incomplete information) [3,5,7,12] that combine both unsupervised and supervised learning to make full use of the collected data. In many cases, the proposed semi-supervised algorithm that uses both labelled and unlabelled data improves the performance of the resulting classiﬁer. Therefore learning from partially labelled data has become an important area of research and a recent workshop—ICML 2005 LPCTD (Learning with Partially Classiﬁed Training Data) Workshop, Germany—was held with this as the theme. Previous studies of growing self-organizing map (GSOM) [1,10,11,16] have all focused on unsupervised clustering tasks. In this paper, we propose to fuse a modiﬁed form of the supervised learning architecture proposed by Fritzke [9] with the GSOM [2], thus taking advantage of a co-evolving topology- preserving network that provides instant data visualisation on 2D network grid and a supervised learning network for class structure visualisation. The modiﬁcations made to the Fritzke’s supervised learning architecture involve changes to the error calculation formula to enable processing of data that have missing labels. After the modiﬁcations, the algorithm becomes semi-supervised. Most importantly, when all labels are present it behaves identically as a supervised one, yet when all labels are missing it functions the same as an unsupervised one, there- bymaximising the use of all information present in the data. Three good reasons for using GSOM as the topology preserving network are:  dynamic allocation of nodes to accommodate for both complex class structure and data similarity;  constantly visualisable 2D grid for better and easier under- standing of complexity, with overlaps and data structure in the labelled data space; ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2008 Published by Elsevier B.V. doi:10.1016/j.neucom.2008.04.049 Ã Corresponding author. E-mail addresses: alhsu@unimelb.edu.au (A. Hsu), saman@unimelb.edu.au (S.K. Halgamuge). Neurocomputing 71 (2008) 3124–3130