AVEDA: Statistical Tests for Finding Interesting Visualisations Katharina Tschumitschew 1 and Frank Klawonn 1,2 1 Department of Computer Science University of Applied Sciences Braunschweig/Wolfenbuettel Salzdahlumer Str. 46/48, D-38302 Wolfenbuettel, Germany 2 Helmholtz Centre for Infection Research Department for Cell Biology Inhoffenstr. 7, D-38124 Braunschweig, Germany Abstract. Visualisation is usually one of the first steps in handling any data anal- ysis problem. Visualisations are an intuitive way to discover inconsistencies, out- liers, dependencies, interesting patterns and peculiarities in the data. However, due to modern computer technology, a vast number of visualisation techniques is available nowadays. Even if only simple scatterplots, plotting pairs of variables against each other, are considered, the number of scatterplots is too large for high- dimensional data to visually inspect each scatterplot. In this paper, we propose a system architecture called AVEDA (Automatic Visual Exploratory Data Analy- sis) which computes a large number of visualisations, filters out those ones that might contain special patterns and shows only these interesting visualisations to the user. The filtering process for the visualisations is based on statistical tests and statistical measures. 1 Introduction According to Tukey [1] “there is no excuse for failing to plot and look” when one wants to solve a data analysis problem. In the earlier days of data analysis, when comput- ers where scarcely available, monitors where restricted to alpha-numeric displays, data visualisation was carried out manually, producing visualisations like bar charts, his- tograms, box plots, stem-and-leaves diagrams or simple scatterplots. This meant that only small data sets could be treated in this way and one could focus on one or at most two variables at the same time. Nowadays, computing and graphical displays al- low fast computation of visualisations even for larger and high-dimensional data sets. This progress in computer technology enabled the application of more sophisticated visualisation techniques like multidimensional scaling (MDS) (see for instance [2]) or principal component analysis (PCA) (see for instance [3]), which need more computa- tional power. But the progress in computer technology also lead to the development of a vast number of new visualisation techniques for data analysis and data mining [4]. However, it is impossible for various reasons to check all possible visualisations individually for the following reasons. – The number of different visualisations is too large, especially for high-dimensional data. Even if only scatterplots are considered, plotting pairs of variables against J.D. Vel´ asquez et al. (Eds.): KES 2009, Part I, LNAI 5711, pp. 236–243, 2009. c Springer-Verlag Berlin Heidelberg 2009