Principle of Learning Metrics for Exploratory Data Analysis SAMUEL KASKI AND JANNE SINKKONEN 1 Neural Networks Research Centre Helsinki University of Technology P.O. Box 9800, FIN-02015 HUT, Finland {samuel.kaski,janne.sinkkonen}@hut.fi Abstract. Visualization and clustering of multivariate data are usually based on mutual distances of samples, measured by heuristic means such as the Euclidean distance of vectors of extracted features. Our recently developed methods remove this arbitrariness by learning to measure important differences. The effect is equiv- alent to changing the metric of the data space. It is assumed that variation of the data is important only to the extent it causes variation in auxiliary data which is available paired to the primary data. The learning of the metric is supervised by the auxiliary data, whereas the data analysis in the new metric is unsupervised. We review two approaches: a clustering algorithm and another that is based on an explicitly generated metric. Applications have so far been in exploratory analysis of texts, gene function, and bankruptcy. Connections for the two approaches are derived, which leads to new promising approaches to the clustering problem. Keywords: Discriminative clustering, exploratory data analysis, Fisher informa- tion matrix, information metric, learning metrics 1 Introduction Unsupervised learning methods search for statistical dependencies in data. They are often used as “discovery tools” or exploratory tools to reveal statistical structures hidden in large data sets. It is sometimes even hoped for that “natural properties” of the data could be discovered in a purely data-driven manner, i.e., without any prior hypotheses. Such discoveries are possible, but the findings are always constrained by the choice of the model family and the data set, including the variable selection or feature extraction. In a broader data analysis or modeling framework the model family specifies which kinds of dependencies are sought for. Typically, selection of the data variables and their preprocessing (feature extraction) specifies which aspects of the data are deemed interesting or important. It might even be said that unsupervised learning is always supervised by these choices. The choice and transformation of the data variables is important because most unsupervised methods depend heavily on the metric, i.e. the distance measure, of the input space. Such methods include at least clustering, probability density estimation, and most visualization methods. The aim of the present work is to automatically learn metrics which measure distances along important or relevant directions, and to use the metrics in unsu- pervised learning. The laborious implicit supervision by manually tailored feature extraction will to a large extent be replaced by an automatically learned metric, 1 The authors contributed equally to the work. 1