Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 163–168
Contents lists available at ScienceDirect
Web Semantics: Science, Services and Agents
on the World Wide Web
journal homepage: www.elsevier.com/locate/websem
Invited paper
Dynamic visualization of statistical learning in the context of high-dimensional
textual data
Michael Greenacre
a,∗
, Trevor Hastie
b
a
Universitat Pompeu Fabra, Barcelona 08005, Spain
b
Stanford University, Stanford, CA 94305-4065, USA
article info
Article history:
Received 29 June 2009
Received in revised form
13 November 2009
Accepted 25 March 2010
Available online 11 June 2010
Keywords:
Animation
Classification
High-dimensional data
Learning
Visualization
Text mining
abstract
Our ability to record increasingly larger and more complex sets of data is accompanied by a decline in our
capacity to interpret and understand these data in the fullest sense. Multivariate analysis partially assists
us in our quest by reducing the dimensionality in optimal ways, but our view is stuck in two dimensions
because of the planar nature of the graphical medium, be it the printed page or the computer screen. We
are developing protocols and tools to add motion to scientific graphics so that high-dimensional data
can be visualized dynamically. Using the freely available R language and modern methods of statistical
learning and data mining, we construct animation sequences that take the viewer on a dynamic journey
through the data. The idea is illustrated using a large data set of all the abstracts of the journal Vaccine
in the years 2003–2006, according to their word frequencies and citation counts.
© 2010 Elsevier B.V. All rights reserved.
1. Introduction
Just as we use the elements of language and style with due
care in the verbal communication of our research, so we should
be paying equal attention to the graphical medium through which
we convey our visual ideas. Graphics has its language and its aes-
thetic elements, such as shading, symbols, color and area. Up to
now the scientific literature uses static graphical images to depict
numerical information. These inanimate figures are by construc-
tion two-dimensional and need visual scanning to comprehend the
information that they convey. A moving graphic, on the other hand,
while occupying the same space on a page, adds the dimension of
time and guides the eye, as we shall demonstrate.
The ability already exists to supplement online scientific articles
with video material—for a technical description, see [1]. An exam-
ple in the health sciences is in the Journal of Ultrasound in Medicine
where several articles include videos, for example the rotation of
three-dimensional embryo and fetal scans in [2]. Our present inter-
est is not in this type of video presentation of three-dimensional
physical objects but rather in the dynamic display of multivari-
ate numerical data—such displays are non-existent in the present
scientific literature, although several proposals have been made
∗
Corresponding author. Tel.: +34 93 5422551.
E-mail addresses: michael@upf.es, michael.greenacre@gmail.com
(M. Greenacre), hastie@stanford.edu (T. Hastie).
for animating statistical analyses to assist in model diagnosis and
interpretation of results (see, for example [3,4]).
Drawing a parallel with motion art is particularly relevant to our
approach. In his book Sight, Sound, Motion [5] on the use of video and
animation, Zettl talks about the encoding process, where an idea is
molded “so that it fits the medium’s technical as well as aesthetic
production and reception requirements. “For Zettl, applied media
aesthetics” places great importance on the influence of the medium
on the message—the medium itself acts as an integral structural
agent.”
To illustrate the idea, the dynamic graphic in Fig. 1 shows the
modeled relationship between the probability that an email is spam
as a function of the proportion of a set of “spam words” that it
contains (this set was identified previously as often occurring in
spam emails). This relationship, which was established using logis-
tic regression, also depends on the length of the email, but in a
complex way. The model relating the three variables p (probability
of spam), S (proportion of spam words) and L (length of email) is as
follows:
log
p
1 - p
=-3.37 - 20.0S - 0.0142L + 0.0000279L
2
+ 0.517SL - 0.00114SL
2
.
The effects of the variables S and L on the probability of spam
are very difficult to interpret, because of the presence of length and
1570-8268/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.websem.2010.03.007