Graph-Theoretic Scagnostics * Leland Wilkinson † SPSS Inc. Northwestern University Anushka Anand ‡ University of Illinois at Chicago Robert Grossman § University of Illinois at Chicago ABSTRACT We introduce Tukey and Tukey scagnostics and develop graph- theoretic methods for implementing their procedure on large datasets. CR Categories: H.5.2 [User Interfaces]: Graphical User Interfaces—Visualization; I.3.6 [Computing Methodologies]: Computer Graphics—Methodology and Techniques; Keywords: visualization, statistical graphics 1 I NTRODUCTION Around 20 years ago, John and Paul Tukey developed an ex- ploratory visualization method called scagnostics. While they briefly mentioned their invention in [42], the specifics of the method were never published. Paul Tukey did offer more detail at an IMA visualization workshop a few years later, but he did not include the talk in the workshop volume he and Andreas Buja edited [7]. Scagnostics was an ingenious idea. Jerome Friedman and Werner Stuetzle, in a paper assessing John Tukey’s lifetime con- tributions to visualization [13], say the following: Draftman’s views (scatterplot matrices) lose their ef- fectiveness when the number of variables is large. Using a projection index similar to that in projection pursuit, the computer could find the most interesting scatterplots to be presented to the user. John had proposals for a wide variety of scagnostic indices to judge the usefulness of scatterplot displays. The widespread use of cognostics and scagnostics has not yet materialized in routine data analysis. These approaches are perhaps among the po- tentially most useful of John’s yet to be explored sug- gestions. Scagnostics have yet to be explored by others, despite this en- couragement. This may be due to the lack of published details. In any case, this paper summarizes the Tukeys’ idea and offers a new approach that we believe follows the spirit of their method. Our approach is based on recent advances in graph-theoretic summaries of high-dimensional scattered point data. We believe our method improves the computational efficiency and extends the scope of the original idea. We will begin with a brief summary of the Tukeys’ approach, based on the first author’s recollection of the IMA workshop and subsequent conversations with Paul Tukey. Then we will present our graph-theoretic measures for computing scagnostic indices. Fi- nally, we will illustrate the performance of our methods on real data. * John Hartigan, David Hoaglin, and Graham Wills provided valuable suggestions. † e-mail: leland@spss.com ‡ e-mail:aanand2@uic.edu § e-mail:grossman@cs.uic.edu 2 TUKEY AND TUKEY SCAGNOSTICS A scatterplot matrix, variously called a SPLOM or casement plot or draftman’s plot, is a (usually) symmetric matrix of pairwise scat- terplots. An easy way to conceptualize a symmetric SPLOM is to think of a covariance matrix of p variables and imagine that each off-diagonal cell consists of a scatterplot of n cases rather than a scalar number representing a single covariance. This display was first published by John Hartigan [19] and was popularized by Tukey and his associates at Bell Laboratories [9]. Figure 1 shows a SPLOM of measurements on abalones using data from [27]. Off the diagonal are the pairwise scatterplots of nine variables. The vari- ables are sex (indeterminate, male, female), shell length, shell di- ameter, shell height, whole weight, shucked weight, viscera weight, shell weight, and number of rings in shell. Figure 1: Scatterplot matrix of Abalone measurements As Friedman and Stuetzle noted, scatterplot matrices become un- wieldy when there are many variables. First of all, the visual res- olution of the display is limited when there are many cells. This defect can be ameliorated by pan and zoom controls. More critical, however, is the multiplicity problem in visual exploration. Looking for patterns in p( p - 1)/2 scatterplots is impractical when there are many variables. This problem prompted the Tukeys’ approach. The Tukeys reduced an O( p 2 ) visual task to an O(k 2 ) visual task, where k is a small number of measures of the distribution of a 2D scatter of points. These measures included the area of the peeled convex hull [40] of the 2D point scatters, the perime- ter length of this hull, the area of closed 2D kernel density isolevel contours [35], [33], the perimiter length of these contours, the con- vexity of these contours, a modality measure of the 2D kernel den- sities, a nonlinearity measure based on principal curves [21] fitted to the 2D scatterplots, and several others. By using these measures,