Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 163–168 Contents lists available at ScienceDirect Web Semantics: Science, Services and Agents on the World Wide Web journal homepage: www.elsevier.com/locate/websem Invited paper Dynamic visualization of statistical learning in the context of high-dimensional textual data Michael Greenacre a,∗ , Trevor Hastie b a Universitat Pompeu Fabra, Barcelona 08005, Spain b Stanford University, Stanford, CA 94305-4065, USA article info Article history: Received 29 June 2009 Received in revised form 13 November 2009 Accepted 25 March 2010 Available online 11 June 2010 Keywords: Animation Classiﬁcation High-dimensional data Learning Visualization Text mining abstract Our ability to record increasingly larger and more complex sets of data is accompanied by a decline in our capacity to interpret and understand these data in the fullest sense. Multivariate analysis partially assists us in our quest by reducing the dimensionality in optimal ways, but our view is stuck in two dimensions because of the planar nature of the graphical medium, be it the printed page or the computer screen. We are developing protocols and tools to add motion to scientiﬁc graphics so that high-dimensional data can be visualized dynamically. Using the freely available R language and modern methods of statistical learning and data mining, we construct animation sequences that take the viewer on a dynamic journey through the data. The idea is illustrated using a large data set of all the abstracts of the journal Vaccine in the years 2003–2006, according to their word frequencies and citation counts. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Just as we use the elements of language and style with due care in the verbal communication of our research, so we should be paying equal attention to the graphical medium through which we convey our visual ideas. Graphics has its language and its aes- thetic elements, such as shading, symbols, color and area. Up to now the scientiﬁc literature uses static graphical images to depict numerical information. These inanimate ﬁgures are by construc- tion two-dimensional and need visual scanning to comprehend the information that they convey. A moving graphic, on the other hand, while occupying the same space on a page, adds the dimension of time and guides the eye, as we shall demonstrate. The ability already exists to supplement online scientiﬁc articles with video material—for a technical description, see [1]. An exam- ple in the health sciences is in the Journal of Ultrasound in Medicine where several articles include videos, for example the rotation of three-dimensional embryo and fetal scans in [2]. Our present inter- est is not in this type of video presentation of three-dimensional physical objects but rather in the dynamic display of multivari- ate numerical data—such displays are non-existent in the present scientiﬁc literature, although several proposals have been made ∗ Corresponding author. Tel.: +34 93 5422551. E-mail addresses: michael@upf.es, michael.greenacre@gmail.com (M. Greenacre), hastie@stanford.edu (T. Hastie). for animating statistical analyses to assist in model diagnosis and interpretation of results (see, for example [3,4]). Drawing a parallel with motion art is particularly relevant to our approach. In his book Sight, Sound, Motion [5] on the use of video and animation, Zettl talks about the encoding process, where an idea is molded “so that it ﬁts the medium’s technical as well as aesthetic production and reception requirements. “For Zettl, applied media aesthetics” places great importance on the inﬂuence of the medium on the message—the medium itself acts as an integral structural agent.” To illustrate the idea, the dynamic graphic in Fig. 1 shows the modeled relationship between the probability that an email is spam as a function of the proportion of a set of “spam words” that it contains (this set was identiﬁed previously as often occurring in spam emails). This relationship, which was established using logis- tic regression, also depends on the length of the email, but in a complex way. The model relating the three variables p (probability of spam), S (proportion of spam words) and L (length of email) is as follows: log  p 1 - p  =-3.37 - 20.0S - 0.0142L + 0.0000279L 2 + 0.517SL - 0.00114SL 2 . The effects of the variables S and L on the probability of spam are very difﬁcult to interpret, because of the presence of length and 1570-8268/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.websem.2010.03.007