Choosing representative data items: Kohonen, Neural Gas or Mixture Model? A case study of erosion data ANNA BARTKOWIAK 1 , JOANNA ZDZIAREK 1 , NIKI EVELPIDOU 2 , ANDREAS VASSILOPOULOS 2 1 Institute of Computer Science, University of Wroclaw, Przesmyckiego 20, Wroclaw 51-151 Pl, e-mail: {aba,zdziarek}@ii.uni,wroc,pl 2 Remote Sensing Laboratory, Geology Dept., University of Athens, Panepistimiopolis Zografou, Athens 15-784, Greece, e-mail: evelpidou@geol.uoa.gr Abstract: When analyzing the erosion risk of Kefallinia, Greece, we have faced the problem, how to choose representatives for a big data set. In this paper we compare 3 methods: 1. Kohonen’s self organizing map (SOM). 2. Neural gas (NG). 3. Mixture model (MM) of Gaussian distributions. The representativeness of the derived prototype vectors is measured by the quantization error, as defined by Kohonen (1995). It appears that neural gas and mixture models surpass quite steadily the neural gas and mixture methods in providing better representatives. To obtain a better insight into the results we map the obtained prototype vectors onto planes obtained by the neuroscale mapping, which seems to be a convenient alternative to Sammon’s mapping. The SOM codebook vectors are visualized in the same planes and linked by threads. This is shown for the Kefallinia erosion data from Greece. Key words: Self-organizing maps, Neural gas, Mixture models, Neuroscale, Thread plotting, Kefallinia Island 1. INTRODUCTION Nowadays we obtain very large data sets containing thousands of data vectors. A proper statistical analysis of such data may cause some problems not only in computing time but also with accuracy of the calculations due to rounding errors. However, in many cases it is not absolutely necessary to use all the data during the analysis – very often a representative sub-sample would be sufficient. How to choose a representative sample from a huge data set? The most popular methods serving this purpose appear under the watchword "vector quantization". The methods work as follows: The entire data space is