A Structure-Based Distance Metric for High-Dimensional Space Exploration with Multidimensional Scaling Jenny Hyunjung Lee, Kevin T. McDonnell, Member, IEEE, Alla Zelenyuk, Dan Imre, and Klaus Mueller, Senior Member, IEEE Abstract—Although the euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging intercluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multidimensional scaling (MDS) where one can often observe nonintuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our biscale framework distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate euclidean distance. Index Terms—Information visualization, multivariate visualization, clustering, high-dimensional data, visual analytics Ç 1 INTRODUCTION T HE recognition of relationships embedded in high- dimensional (multiattribute) data remains a challenging task, and visual analytics has been identified as a powerful means to aid humans in this mission. Visual analytics appeals to the intricate pattern recognition faculties of the human visual system which can recognize relationships with ease when presented in a suitable visual manifestation [4]. One such paradigm, especially useful for the visualization of high- dimensional data relationships on a 2D canvas amenable to human perception is multidimensional scaling (MDS) [15], [24]. MDS seeks to visually group data objects so that similar objects are close to each other and dissimilar data objects are far away, as judged by some similarity metric. As such, MDS provides a good visual overview on the data. However, when using these types of overview displays it is important to realize that relationships portrayed with MDS (or any other low-dimensional embedding technique) are still only approximations. There are numerous ways to embed high-dimensional data into 2D, and unless the high- dimensional space is trivial, there are always data relation- ships that are being suppressed. While the protocol used to optimize the embedding certainly plays a significant role here, the similarity metric used to gauge the distance relationships plays another important part. By far the most popular metric to guide 2D MDS (and other) layouts for the visualization of high-dimensional data is the euclidean distance. However, once the number of dimensions grows, the contribution of each coordinate to the euclidian distance rapidly decreases and ultimately all high- dimensional data points have similar distances from one another [2]. As a consequence, a low-dimensional embedding computed from these distances is not overly robust to small distance perturbations and this and other peculiar phenom- ena associated with high-dimensional space are commonly referred to as the curse of dimensionality [2]. In fact, it is already at relatively low dimensionality, say 10, that the use of the euclidean distance as a means to gauge the spatial proximity of two distant points becomes questionable [3]. MDS is well suited to show proximity relationships in the data, however any quantitative information on the data points is lost. Hence, MDS is often used in conjunction with parallel coordinate (PC) plots [13] by which analysts can inspect the data at an attribute level. A PC plot is generated by erecting a set of parallel coordinate axes—one per attribute. Each data point then gives rise to a piecewise linear line called polyline which is defined by connecting the corresponding attribute values on these parallel axes. We shall call the path of such a polyline its signature or structure. By looking at these plots, users visually aggregate the data by the patterns the polylines exhibit across the dimension axes. The usefulness of parallel coordinates for practical applications executed by mainstream users has been demonstrated by Siirtola et al. [21]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 3, MARCH 2014 351 . J.H. Lee and K. Mueller are with the Visual Analytics and Imaging Laboratory, Department of Computer Science, Center for Visual Comput- ing, Stony Brook University, Stony Brook, NY 11794-4400. E-mail: {hyunjlee, mueller}@cs.sunysb.edu. . K.T. McDonnell is with the Department of Mathematics and Computer Science, Dowling College, Idle Hour Blvd., Oakdale, NY 11769-1999. E-mail: mcdonnek@dowling.edu. . A. Zelenyuk is with the Pacific Northwest National Laboratory, 3335 Innovation Blvd., P.O. Box 999, MSIN K8-88, Richland, WA 99354. E-mail: alla.zelenyuk@pnnl.gov. . D. Imre is with Imre Consulting, 181 McIntosh Ct., Richland, WA 99352. E-mail: dimre2b@gmail.com. Manuscript received 15 Oct. 2012; revised 20 Mar. 2012; accepted 17 June 2013; published online 11 July 2013. Recommended for acceptance by H. Pottmann. For information on obtaining reprints of this article, please send e-mail to: tvcg@computer.org, and reference IEEECS Log Number TVCG-2012-10-0231. Digital Object Identifier no. 10.1109/TVCG.2013.101. 1077-2626/14/$31.00 ß 2014 IEEE Published by the IEEE Computer Society