CheckViz : Sanity Check and Topological Clues for Linear and Nonlinear Mappings Sylvain Lespinats 1 , Michaël Aupetit 2             ! " #$%%% $&$$! #!’   ( )* +  +!,* *&  + #$-- .*)$$/ #! *’0*1!’) 2 *!,’0*1!’) Abstract Multidimensional scaling is a must-have tool for visual data miners, projecting multidimensional data onto a two-dimensional plane. However, what we see is not necessarily what we think about. In many cases, end-users do not take care of scaling the projection space with respect to the multidimensional space. Anyway, when using nonlinear mappings, scaling is not even possible. Yet, without scaling geometrical structures which might appear do not make more sense than considering a random map. Without scaling, we shall not make inference from the display back to the multidimensional space. No clusters, no trends, no outliers, there is nothing to infer without first quantifying the mapping quality. Several methods to qualify mappings have been devised. Here, we propose CheckViz, a new method belonging to the framework of Verity Visualization [WPL95]. We define a two- dimensional perceptually uniform colour coding which allows visualising tears and false neighbourhoods, the two elementary and complementary types of geometrical mapping distortions, straight onto the map at the location where they occur. As examples shall demonstrate, this visualisation method is essential to help users make sense out of the mappings and to prevent them from over interpretations. It could be applied to check other mappings as well. Categories and subject descriptors: multidimensional data; nonlinear mapping; multidimensional scaling; evaluation; quality visualisation 1. Introduction 1.1. Multidimensional scaling Mapping methods are generally designed to display data from a high-dimensional original space into a low- dimensional projection space. Such methods reduce the data dimensionality, and can be used to visualize the spatial organization of the dataset, as a preprocessing to escape from the curse of dimensionality phenomenon [Don00, AHK01] or to embed data from a metric space to a Euclid- ean vector space. Mapping methods can be used to unfold the dataset’s underlying manifold. In the sequel, "items" denote original data and "data points" denote their mapping into the projection space. Since Torgerson's "embedding theorem" [Tor52], dis- tance preservation is the objective of most of mapping methods. Indeed, the goal of Principal Component Analysis (PCA) [Jol02, Pea01] is equivalent to look for the linear projection that preserves Euclidean distances at best in terms of the mean-square-error criterion. PCA belongs to the family of linear projection techniques such as Projec- tion Pursuit [FT74] and the Grand Tour [Asi85]. The To- gerson’s Classical MDS consists in mapping data points from a distance matrix. Both the PCA of a set of normal- ized vector items, and the Classical MDS of the Euclidean distance matrix of these same items, provide the same set of data points up to an isometry [Gow66]. In that framework, many methods have been designed to especially account for small distances, leading to nonlin- ear mappings. There are a large number of such methods known as nonlinear multidimensional scaling (NL-MDS). In this work, we focus on some typical NL-MDS meth- ods which are prone to exhibit specific types of mapping distortions, providing various enough examples to demon- strate our claims. Sammon's Non Linear Mapping (NLM) [Sam69] and Demartines et al. Curvilinear Component Analysis (CCA) [DH97] minimize the difference between distances in the input and projection spaces, weighted by a decreasing function of the input or projection distances respectively. Lespinats et al. Data Driven High- Dimensional Scaling (DD-HDS) [LVG*07] makes a com- bination of both NLM and CCA weighting functions.