Parallel Tag Clouds to Explore and Analyze Faceted Text Corpora
Christopher Collins
*
University of Toronto
Fernanda B. Vi´ egas, and Martin Wattenberg
†
IBM Research
ABSTRACT
Do court cases differ from place to place? What kind of picture do
we get by looking at a country’s collection of law cases? We intro-
duce Parallel Tag Clouds: a new way to visualize differences amongst
facets of very large metadata-rich text corpora. We have pointed Par-
allel Tag Clouds at a collection of over 600,000 US Circuit Court
decisions spanning a period of 50 years and have discovered regional
as well as linguistic differences between courts. The visualization
technique combines graphical elements from parallel coordinates and
traditional tag clouds to provide rich overviews of a document collec-
tion while acting as an entry point for exploration of individual texts.
We augment basic parallel tag clouds with a details-in-context dis-
play and an option to visualize changes over a second facet of the
data, such as time. We also address text mining challenges such as
selecting the best words to visualize, and how to do so in reasonable
time periods to maintain interactivity.
Keywords: Text visualization, corpus visualization, information re-
trieval, text mining, tag clouds.
1 I NTRODUCTION
Academics spend entire careers deeply analyzing important texts,
such as classical literature, poetry, and political documents. The
study of the language of the law takes a similar ‘deep reading’ ap-
proach [29]. Deep knowledge of a domain helps experts understand
how one author’s word choice and grammatical constructs differ from
another, or how the themes in texts vary. While we may never replace
such careful expert analysis of texts, and we likely will never want to,
there are statistical tools that can provide overviews and insights into
large text corpora in relatively little time. This sort of ‘distant read-
ing’ on a large scale, advocated by Moretti [21], is the focus of this
work. Statistical tools alone are not sufficient for ‘distant reading’
analysis: methods to aid in the analysis and exploration of the results
of automated text processing are needed, and visualization is one ap-
proach that may help.
Of particular interest are corpora that are faceted — scholars often
try to understand how the contents differ across the facets. Facets can
be understood as orthogonal, non-exclusive categories that describe
multiple aspects of information sources. For example, how does the
language of Shakespeare’s comedies compare to his tragedies? With
rich data for faceted subdivision, we could also explore the same data
by length of the text, year of first performance, etc. Documents often
contain rich meta-data that can be used to define facets: for example
publication date, author name, or topic classification. Text features
useful for faceted navigation can also be automatically inferred dur-
ing text pre-processing, such as geographic locations extracted from
the text [5], or the emotional leaning of the content [9].
In the legal domain, a question often asked is whether different
court districts tend to hear different sorts of cases. This question
is of particular interest to legal scholars investigating ‘forum shop-
ping’ (the tendency to bring a case in a district considered to have a
*
e-mail: ccollins@cs.utoronto.ca
†
e-mail: {viegasf,mwatten}@us.ibm.com
higher likelihood to rule favorably), and this was the initial motiva-
tion for this investigation. Our research question, then, is whether we
can discover distinguishing differences in the cases heard by different
courts. We address this question through examination of the written
decisions of judges. The decisions of US Courts are officially in the
public domain, but only recently have high-quality machine-readable
bulk downloads been made freely available [19]. Providing tools to
augment our understanding of the history and regional variance of
legal decision making is an important societal goal as well as an in-
teresting research challenge. Beyond our specific case study in legal
data, we are interested in broader issues such as easing the barriers to
overview and analysis of large text corpora by non-experts, and pro-
viding quick access to interesting documents within text collections.
Our solution combines text mining to discover the distinguishing
terms for a facet, and a new visualization technique we call Paral-
lel Tag Clouds (PTCs) to display and interact with the results (see
Fig. 1). PTCs blend the visual techniques of parallel coordinate
plots [15] and tag clouds. Rich interaction and a coordinated doc-
ument browsing visualization allow PTCs to become an entry point
into deeper analysis. In the remainder of this paper we will describe
PTCs in comparison to existing methods of corpus visualization, the
interaction and coordinated views provided to support analytics, our
text mining and data parsing approach, and some example scenarios
of discovery within the legal corpus.
2 BACKGROUND
2.1 Exploring Text Corpora
For the purposes of our work, we define facets in a corpus as data
dimensions along which a data set can be subdivided. Facets have a
name, such as ‘year of publication’ and data values such as ‘1999’
which can be used to divide data items. Attention to faceted infor-
mation has generally been focused on designing search interfaces to
support navigation and filtering within large databases (e. g., [11]).
In faceted browsing and navigation, such as the familiar interfaces
of Amazon.com and Ebay.com, information seekers can divide data
along a facet, select a value to isolate a data subset, then further divide
along another facet. For our purposes, we divide a document collec-
tion along a selected facet, and visualize how the aggregate contents
of the documents in each subset differ.
While there are many interfaces for visualizing individual doc-
uments and overviews of entire text corpora e. g., [3, 10, 33, 35],
there are relatively few attempts to provide overviews to differentiate
among facets within a corpus. Notable exceptions include compar-
ison tag clouds [13] for comparing two documents, and the radial,
space-filling visualization of [26] for comparing essays in a collec-
tion. Neither of these comparative visualizations focus on both visu-
alization and appropriate text mining as a holistic analytic system, but
rather use simple word counts to illustrate differences among docu-
ments. The work most related to PTCs is Themail [30], a system for
extracting significant words from email conversations using statisti-
cal measures and visualizing them using parallel columns of words
along a timeline. The visualization approach of PTCs shares the fo-
cus on discovering differentiating words within subsets of a corpus,
and visualizes text along parallel columns of words. However, PTCs
can reveal significant absence, or underuse of a word, as well as sig-
nificant presence, or overuse. We augment the Themail approach
with connections between related data subsets. PTCs are also visu-
91
IEEE Symposium on Visual Analytics Science and Technology
October 12 - 13, Atlantic City, New Jersey, USA
978-1-4244-5283-5/09/$25.00 ©2009 IEEE