Parallel Tag Clouds to Explore and Analyze Faceted Text Corpora Christopher Collins * University of Toronto Fernanda B. Vi´ egas, and Martin Wattenberg IBM Research ABSTRACT Do court cases differ from place to place? What kind of picture do we get by looking at a country’s collection of law cases? We intro- duce Parallel Tag Clouds: a new way to visualize differences amongst facets of very large metadata-rich text corpora. We have pointed Par- allel Tag Clouds at a collection of over 600,000 US Circuit Court decisions spanning a period of 50 years and have discovered regional as well as linguistic differences between courts. The visualization technique combines graphical elements from parallel coordinates and traditional tag clouds to provide rich overviews of a document collec- tion while acting as an entry point for exploration of individual texts. We augment basic parallel tag clouds with a details-in-context dis- play and an option to visualize changes over a second facet of the data, such as time. We also address text mining challenges such as selecting the best words to visualize, and how to do so in reasonable time periods to maintain interactivity. Keywords: Text visualization, corpus visualization, information re- trieval, text mining, tag clouds. 1 I NTRODUCTION Academics spend entire careers deeply analyzing important texts, such as classical literature, poetry, and political documents. The study of the language of the law takes a similar ‘deep reading’ ap- proach [29]. Deep knowledge of a domain helps experts understand how one author’s word choice and grammatical constructs differ from another, or how the themes in texts vary. While we may never replace such careful expert analysis of texts, and we likely will never want to, there are statistical tools that can provide overviews and insights into large text corpora in relatively little time. This sort of ‘distant read- ing’ on a large scale, advocated by Moretti [21], is the focus of this work. Statistical tools alone are not sufficient for ‘distant reading’ analysis: methods to aid in the analysis and exploration of the results of automated text processing are needed, and visualization is one ap- proach that may help. Of particular interest are corpora that are faceted — scholars often try to understand how the contents differ across the facets. Facets can be understood as orthogonal, non-exclusive categories that describe multiple aspects of information sources. For example, how does the language of Shakespeare’s comedies compare to his tragedies? With rich data for faceted subdivision, we could also explore the same data by length of the text, year of first performance, etc. Documents often contain rich meta-data that can be used to define facets: for example publication date, author name, or topic classification. Text features useful for faceted navigation can also be automatically inferred dur- ing text pre-processing, such as geographic locations extracted from the text [5], or the emotional leaning of the content [9]. In the legal domain, a question often asked is whether different court districts tend to hear different sorts of cases. This question is of particular interest to legal scholars investigating ‘forum shop- ping’ (the tendency to bring a case in a district considered to have a * e-mail: ccollins@cs.utoronto.ca e-mail: {viegasf,mwatten}@us.ibm.com higher likelihood to rule favorably), and this was the initial motiva- tion for this investigation. Our research question, then, is whether we can discover distinguishing differences in the cases heard by different courts. We address this question through examination of the written decisions of judges. The decisions of US Courts are officially in the public domain, but only recently have high-quality machine-readable bulk downloads been made freely available [19]. Providing tools to augment our understanding of the history and regional variance of legal decision making is an important societal goal as well as an in- teresting research challenge. Beyond our specific case study in legal data, we are interested in broader issues such as easing the barriers to overview and analysis of large text corpora by non-experts, and pro- viding quick access to interesting documents within text collections. Our solution combines text mining to discover the distinguishing terms for a facet, and a new visualization technique we call Paral- lel Tag Clouds (PTCs) to display and interact with the results (see Fig. 1). PTCs blend the visual techniques of parallel coordinate plots [15] and tag clouds. Rich interaction and a coordinated doc- ument browsing visualization allow PTCs to become an entry point into deeper analysis. In the remainder of this paper we will describe PTCs in comparison to existing methods of corpus visualization, the interaction and coordinated views provided to support analytics, our text mining and data parsing approach, and some example scenarios of discovery within the legal corpus. 2 BACKGROUND 2.1 Exploring Text Corpora For the purposes of our work, we define facets in a corpus as data dimensions along which a data set can be subdivided. Facets have a name, such as ‘year of publication’ and data values such as ‘1999’ which can be used to divide data items. Attention to faceted infor- mation has generally been focused on designing search interfaces to support navigation and filtering within large databases (e. g., [11]). In faceted browsing and navigation, such as the familiar interfaces of Amazon.com and Ebay.com, information seekers can divide data along a facet, select a value to isolate a data subset, then further divide along another facet. For our purposes, we divide a document collec- tion along a selected facet, and visualize how the aggregate contents of the documents in each subset differ. While there are many interfaces for visualizing individual doc- uments and overviews of entire text corpora e. g., [3, 10, 33, 35], there are relatively few attempts to provide overviews to differentiate among facets within a corpus. Notable exceptions include compar- ison tag clouds [13] for comparing two documents, and the radial, space-filling visualization of [26] for comparing essays in a collec- tion. Neither of these comparative visualizations focus on both visu- alization and appropriate text mining as a holistic analytic system, but rather use simple word counts to illustrate differences among docu- ments. The work most related to PTCs is Themail [30], a system for extracting significant words from email conversations using statisti- cal measures and visualizing them using parallel columns of words along a timeline. The visualization approach of PTCs shares the fo- cus on discovering differentiating words within subsets of a corpus, and visualizes text along parallel columns of words. However, PTCs can reveal significant absence, or underuse of a word, as well as sig- nificant presence, or overuse. We augment the Themail approach with connections between related data subsets. PTCs are also visu- 91 IEEE Symposium on Visual Analytics Science and Technology October 12 - 13, Atlantic City, New Jersey, USA 978-1-4244-5283-5/09/$25.00 ©2009 IEEE