....
....
Fig. 1: The inverse relationships of D and V in the data matrix.
Balanced Layouts Using the Composite Data-Variable Matrix
Shenghui Cheng, Bing Wang, Zhiyuan Zhang, Klaus Mueller
Visual Analytics and Imaging Lab, Computer Science Department, Stony Brook University, NY, USA and SUNY Korea, Songdo, Korea
ABSTRACT
Numerous methods have been described that allow the visualization
of the data-variable matrix. But all suffer from a common problem-
visualizing the data and variable points separately which is hard for
people to catch the relations in data and variables together. We de-
scribe a method that allows data and variables balanced layouts. We
achieve it by combining two distance matrices typically used in iso-
lation – the distance matrix encoding the similarities of the variables
and the distance matrix encoding the similarity of the data points.
The remaining two submatrices are obtained by creating a fused
distance matrix – one that measures the distance of data points with
respect to the variables or vice versa. We then use MDS to simulta-
neously optimize the placement of data points and variable points,
producing a display that allows users to appreciate all three types of
relationships in a single display: (1) the patterns of the collection of
data items, (2) the patterns of the collection of variables, and (3) the
relationships of data items with the variables and vice versa.
1 I NTRODUCTION
The data matrix is a common representation high-dimensional da-
tasets. Let N be the number of samples (or data points) drawn from a
given population and let D be the number of attributes (or variables)
measured per sample – we then obtain an N×D data matrix. In this
data matrix, the samples and attributes can change roles. For exam-
ple, for a data matrix storing the results of a DNA microarray exper-
iment for multiple specimens, one research objective might consider
the genes expressed in the microarray to be the samples and the spec-
imens to be the attributes, or vice versa. Switching from one objec-
tive to the other formally requires a transposition of the data matrix.
Numerous methods have been described that allow the visualiza-
tion of the data matrix. Embedding the high-dimensional space onto
a 2D canvas via a suitable optimization strategy is a common strate-
gy. In a low-dimensional space embedding, such as multi-
dimensional scaling (MDS) [1], linear discriminant analysis (LDA),
and others the attributes are even completely suppressed and only
clusters of samples can be visually appreciated.
While changing the roles of samples and attributes is easy – it re-
quires a simple transpose of the data matrix – the unequal treatment
of attributes and samples represents a significant problem. It makes it
difficult to observe patterns formed by attributes and samples simul-
taneously, and it also makes it difficult to see the samples in the
proper context of the attributes. The method we propose provides
such a comprehensive display. It uses MDS to simultaneously opti-
mize the placement of samples and attributes.
2 T HE COMPOSITE DISTANCE MATRIX
Let be the data matrix with rows and columns,
where the rows denote the data points, the columns denote the varia-
bles and is the data value in the th row and th column. Without
loss of generality, we assume is normalized to [0, 1]. Now let D
be the data space with m data points:
Let be the variable space with variables:
where T is the transpose function. Thus, we can look at in two
ways. We can map it into variable space V in which D represents the
points, or we can map it into data space D where V represents the
points. An illustration of the inverse relationship is provided in Fig. 1.
2.1 Extending the Data Matrix
As mentioned, current visualization methods tend to look at the two
spaces – data space and variable space – in an imbalanced fashion.
The usual resort is to either visualize the data matrix or its transpose
with the algorithm at hand which then lowers the fidelity of one
space at the cost of the other. But it can often be beneficial to see
both spaces at the same time and do so in a balanced way where all
relationships – data to data, data to variables, and variables to varia-
bles – are conveyed at equal fidelity.
Visualization of relationships in a data matrix can be made ex-
plicit by transforming it into a distance matrix. The notion of dis-
tance (also often called dissimilarity) can take many forms – Euclidi-
an, cosine similarity, correlation, etc. But in all cases, the matrix
stores the pairwise distances of two data matrix vectors, either V or D,
but not both. So only one type of relation gets expressed, V or D.
Our solution is to create a distance matrix in which both types of
relations are equally expressed. We call this matrix the composite
distance matrix and the space the composite space (see Fig. 2). In
this composite space, both data and variables can be located at the
same time. The composite space is symmetrical since data and varia-
bles are complementary.
2.2 Creating the Composite Distance Matrix
We can derive the composite distance matrix as follows:
Here, stores the pairwise dissimilarities of the data points,
and DV store the pairwise dissimilarities of the variables with the
data points, and stores the pairwise dissimilarities of variables.
As mentioned, there are various measures suitable to express dis-
tance or dissimilarity. However, these measures have sometimes
opposite meaning. Let be the function of Dissimilarity Metrics
where F=Euclidian Distance||1-Cosine Similarity||1-Correlation||…
2.1 The Data to Data Distance Matrix (DD)
The data points are vectors of equal length. The dissimilarity can be
obtained using any of the functions in F. Then the DD matrix is an
m×m matrix with elements:
.
To demonstrate our method with a controlled experiment, we gener-
ated a test dataset comprised of a set of 6 6-D Gaussian distributions.
235
IEEE Symposium on Visual Analytics Science and Technology 2014
November 9-14, Paris, France
978-1-4799-6227-3/14/$31.00 ©2014 IEEE