Find Distance Function, Hide Model Inference Jingjing Liu Tufts University Eli T. Brown Tufts University Remco Chang Tufts University ABSTRACT Faced with a large, high-dimensional dataset, many turn to data analysis approaches that they understand less well than the domain of their data. An expert’s knowledge can be leveraged into many types of analysis via a domain-specific distance function, but creat- ing such a function is not intuitive to do by hand. We have created a system that shows an initial visualization, adapts to user feed- back, and produces a distance function as a result. Specifically, we present a multidimensional scaling (MDS) visualization and an iter- ative feedback mechanism for a user to affect the distance function that informs the visualization without having to adjust the param- eters of the visualization directly. An encouraging experimental result suggests that using this tool, data attributes with useless data are given low importance in the distance function. 1 I NTRODUCTION There are many powerful data visualization and analysis tech- niques at the fingertips of anyone with data to understand. Many techniques rely on a distance function. That is, these algorithms re- quire a function that assigns a numeric distance to any two points in the input data space. To build one requires more domain expertise than an analysis specialist has, and more analysis expertise than the domain expert has. In this work, we present a platform that allows a user not only to explore data visually, but to provide feedback that informs the un- derlying visualization-generating model how to adapt its distance function. The user does not have to manipulate the parameters of the model directly, but rather isolate what data points are inconsis- tent with her understanding of the domain, and fix them. The model gets adjusted, resulting in a new visualization for her to iteratively improve. This process allows the user to explore her data by testing hypotheses about its structure. She ultimately ends up with a useful product: a distance function which she can use for further work on her data. There are a variety of mathematical models for visualizing high- dimensional data in two dimensions. For this poster, we chose one dimension-reduction model, multidimensional scaling (MDS) [1], which maps a high-dimensional dataset to lower-dimensions by preserving pairwise distances between datapoints accross the high- and low-dimensional spaces. The function used to compute those distances gets changed iteratively through the user’s interac- tion with the visualization. There are already tools that allow a user to modify the parame- ters of a models generating a visualization, including work by one of this paper’s authors [4] [2]. The drawback of these tools is that they require the user to be an expert in the model used to gener- ate the visualization. In this work, because we compute the effect on the distance function for the user, she does not need to know about MDS or model inference to influence the results based on e-mail:jingjing.liu@tufts.edu e-mail:ebrown@cs.tufts.edu e-mail:remco@cs.tufts.edu her knowledge. A recent work of Endert et al. [3] created a simi- lar visual interaction framework to ours, allowing a user to interact with a visualization to update MDS, saving the user from the agony of understanding the mathematical technicalities. Our work can be distinguished in several aspects: 1) in addition to acquiring a visu- alization of the dataset that fits the user’s mental image, we focus on producing a distance function for the data domain that the user can use to discover hidden patterns and gain deeper understanding in further work; 2) our user-feedback adjustments are based on an ob- jective function that not only considers the latest changes, but tries to maintain the structure of the rest of the data; 3) the interactive process can be iteratively continued until the visualization achieves the user’s satisfaction, i.e., previous updates will affect final output. 2 APPROACH AND METHOD One step of the interactive process using the visual analytic tool in this work is as follows: 1) System provides a visualization based on initial values of model parameters. 2) Users observe the visual- ization and provide input in a predefined format. 3) System adjusts the parameters of the model to reflect the user’s understanding and regenerates an updated visualization based on the new parameter values. 4) User observes this visualization and decides either to keep this modification or not. As shown in Figure 1, the process starts with receiving high- dimensional dataset as input, then iteratively updates the distance function until user is satisfied with 2D projection. Input Data 2D Projection User Optimization Optimization User Input Output Distance Function Figure 1: Flow chart showing the interactive process. 2.1 Producing the Visualization Data visualizations for high-dimensional datasets are concerned with data X = {x 1 , x 2 , ..., x N }, with each instance x i given by an M- dimensional vector that specifies a value in each of the M features of the dataset. Alongside the data is a vector representing the relative importance of each of the features in the form of a weight vector Θ =[θ 1 , θ 2 , ..., θ M ]. A simple, linear distance function D(x, y|Θ) takes Θ as parameters and computes a real number that quantifies the dissimilarity between two data points x and y. Classic multi- dimensional scaling takes an input matrix giving dissimilarities be- tween all pairs of data points and maps it to a low dimensional (in this case two-dimensional) space, minimizing a stress function [1]. 289 IEEE Symposium on Visual Analytics Science and Technology October 23 - 28, Providence, RI, USA 978-1-4673-0014-8/11/$26.00 ©2011 IEEE