0018-9162/99/$10.00 © 1999 IEEE August 1999 51
Inter active Data
Analysis: The
Contr ol Pr oject
D
ata analysis is fundamentally an iterative
process in which you issue a query, receive a
response, formulate the next query based on
the response, and repeat. You usually don’t
issue a single, perfectly chosen query and get
the information you want from a database; indeed,
the purpose of data analysis is to extract unknown
information, and in most situations there is no one
perfect query.
1
People naturally start by asking broad,
big-picture questions and then continually refine their
questions based on feedback and domain knowledge.
2
Consider repeating this process several times over,
sifting through many more results, and you have an
idea of why using advanced data analysis tools is so
complex. Composing Structured Query Language
(SQL) queries for decision-support database manage-
ment systems (DBMSs) isn’t easy, and even users of
graphical query tools find it difficult to generate
insightful queries.
Although data-mining systems typically don’t pro-
vide complicated query languages, to use these systems
you need to choose a suitable mining algorithm and
carefully tune various algorithm-specific parameters
such as support and confidence for association rule
mining, thresholds for clustering, training sets for clas-
sification, and so on. These usability problems increase
the number of iterations in the analysis process; you
have to try algorithms with different parameters until
you find one that produces useful results. In addition,
many of these tools require complicated, time-con-
suming setup phases before they can be used at all.
Most research in the areas of decision support, data
visualization, statistics, data mining and knowledge
discovery has concentrated on improving a single iter-
ation of the analysis process. Some work has focused
on improving the quality of a particular analysis result
or on reducing the time it takes for each analysis step
or algorithm to provide a complete response.
These fields have progressed greatly, but this
research focus ignores a basic invariant in computing:
Full-scale data analysis will always be slow. As Greg
Papadopoulos, chief technology officer at Sun, points
out, the appetite for data collection, storage, and
analysis is outstripping Moore’s law, meaning that the
time required to analyze massive data sets is steadily
growing. To date, the result is a worst-case mode of
human-computer interaction: Data analysis is a com-
plex process involving multiple, time-consuming steps,
and a poor or erroneous choice of inputs is not notice-
able until results return at the end of a given step. The
long delay and absolute lack of control during indi-
vidual analysis steps disrupt the user’s concentration
and hamper the data analysis process. This situation is
reminiscent of Herodotus’ lament: “Of all men’s mis-
eries, the bitterest is this: to know so much and have
control over nothing.”
In the Control (Continuous Output and Navigation
Technology with Refinement Online) project at
Berkeley, we are working with collaborators at IBM,
Informix, and elsewhere to explore ways to improve
human-computer interaction during data analysis. The
Control project’s goal is to develop interactive, intu-
itive techniques for analyzing massive data sets. We
focus on systems that iteratively refine answers to
queries and give users online control of processing,
thereby tightening the data analysis process loop. You
can use our techniques in diverse software contexts
including decision support database systems, data visu-
alization, data mining, and user interface toolkits.
BATCH VERSUS ONLINE PROCESSING
Traditional analysis tools have a black-box interface:
The user issues queries, the system processes silently for
a significant period, and then the system returns an exact
answer. Because of the long processing times, this inter-
action is reminiscent of the batch processing of the 1960s
Human insight is crucial for extracting meaning out of massive data sets,
yet current user interactions with databases don’t allow iterative, intuitive
analysis. The Control project looks at ways to give users quicker, more
direct interactivity with the data.
Joseph M.
H ellerstein
Ron Avnur
Andy Chou
Christian
Hidber
Chris O lston
Vijayshankar
Raman
Tali Roth
University of
California,
Berkeley
Peter J. H aas
IBM
Almaden
Research
Center
Cover Feature