Find Distance Function, Hide Model Inference
Jingjing Liu
∗
Tufts University
Eli T. Brown
†
Tufts University
Remco Chang
‡
Tufts University
ABSTRACT
Faced with a large, high-dimensional dataset, many turn to data
analysis approaches that they understand less well than the domain
of their data. An expert’s knowledge can be leveraged into many
types of analysis via a domain-specific distance function, but creat-
ing such a function is not intuitive to do by hand. We have created
a system that shows an initial visualization, adapts to user feed-
back, and produces a distance function as a result. Specifically, we
present a multidimensional scaling (MDS) visualization and an iter-
ative feedback mechanism for a user to affect the distance function
that informs the visualization without having to adjust the param-
eters of the visualization directly. An encouraging experimental
result suggests that using this tool, data attributes with useless data
are given low importance in the distance function.
1 I NTRODUCTION
There are many powerful data visualization and analysis tech-
niques at the fingertips of anyone with data to understand. Many
techniques rely on a distance function. That is, these algorithms re-
quire a function that assigns a numeric distance to any two points in
the input data space. To build one requires more domain expertise
than an analysis specialist has, and more analysis expertise than the
domain expert has.
In this work, we present a platform that allows a user not only to
explore data visually, but to provide feedback that informs the un-
derlying visualization-generating model how to adapt its distance
function. The user does not have to manipulate the parameters of
the model directly, but rather isolate what data points are inconsis-
tent with her understanding of the domain, and fix them. The model
gets adjusted, resulting in a new visualization for her to iteratively
improve. This process allows the user to explore her data by testing
hypotheses about its structure. She ultimately ends up with a useful
product: a distance function which she can use for further work on
her data.
There are a variety of mathematical models for visualizing high-
dimensional data in two dimensions. For this poster, we chose
one dimension-reduction model, multidimensional scaling (MDS)
[1], which maps a high-dimensional dataset to lower-dimensions
by preserving pairwise distances between datapoints accross the
high- and low-dimensional spaces. The function used to compute
those distances gets changed iteratively through the user’s interac-
tion with the visualization.
There are already tools that allow a user to modify the parame-
ters of a models generating a visualization, including work by one
of this paper’s authors [4] [2]. The drawback of these tools is that
they require the user to be an expert in the model used to gener-
ate the visualization. In this work, because we compute the effect
on the distance function for the user, she does not need to know
about MDS or model inference to influence the results based on
∗
e-mail:jingjing.liu@tufts.edu
†
e-mail:ebrown@cs.tufts.edu
‡
e-mail:remco@cs.tufts.edu
her knowledge. A recent work of Endert et al. [3] created a simi-
lar visual interaction framework to ours, allowing a user to interact
with a visualization to update MDS, saving the user from the agony
of understanding the mathematical technicalities. Our work can be
distinguished in several aspects: 1) in addition to acquiring a visu-
alization of the dataset that fits the user’s mental image, we focus on
producing a distance function for the data domain that the user can
use to discover hidden patterns and gain deeper understanding in
further work; 2) our user-feedback adjustments are based on an ob-
jective function that not only considers the latest changes, but tries
to maintain the structure of the rest of the data; 3) the interactive
process can be iteratively continued until the visualization achieves
the user’s satisfaction, i.e., previous updates will affect final output.
2 APPROACH AND METHOD
One step of the interactive process using the visual analytic tool
in this work is as follows: 1) System provides a visualization based
on initial values of model parameters. 2) Users observe the visual-
ization and provide input in a predefined format. 3) System adjusts
the parameters of the model to reflect the user’s understanding and
regenerates an updated visualization based on the new parameter
values. 4) User observes this visualization and decides either to
keep this modification or not.
As shown in Figure 1, the process starts with receiving high-
dimensional dataset as input, then iteratively updates the distance
function until user is satisfied with 2D projection.
Input Data
2D Projection
User
Optimization Optimization
User Input
Output
Distance Function
Figure 1: Flow chart showing the interactive process.
2.1 Producing the Visualization
Data visualizations for high-dimensional datasets are concerned
with data X = {x
1
, x
2
, ..., x
N
}, with each instance x
i
given by an M-
dimensional vector that specifies a value in each of the M features of
the dataset. Alongside the data is a vector representing the relative
importance of each of the features in the form of a weight vector
Θ =[θ
1
, θ
2
, ..., θ
M
]. A simple, linear distance function D(x, y|Θ)
takes Θ as parameters and computes a real number that quantifies
the dissimilarity between two data points x and y. Classic multi-
dimensional scaling takes an input matrix giving dissimilarities be-
tween all pairs of data points and maps it to a low dimensional (in
this case two-dimensional) space, minimizing a stress function [1].
289
IEEE Symposium on Visual Analytics Science and Technology
October 23 - 28, Providence, RI, USA
978-1-4673-0014-8/11/$26.00 ©2011 IEEE