0018-9162/99/$10.00 © 1999 IEEE August 1999 51 Inter active Data Analysis: The Contr ol Pr oject D ata analysis is fundamentally an iterative process in which you issue a query, receive a response, formulate the next query based on the response, and repeat. You usually don’t issue a single, perfectly chosen query and get the information you want from a database; indeed, the purpose of data analysis is to extract unknown information, and in most situations there is no one perfect query. 1 People naturally start by asking broad, big-picture questions and then continually reﬁne their questions based on feedback and domain knowledge. 2 Consider repeating this process several times over, sifting through many more results, and you have an idea of why using advanced data analysis tools is so complex. Composing Structured Query Language (SQL) queries for decision-support database manage- ment systems (DBMSs) isn’t easy, and even users of graphical query tools find it difficult to generate insightful queries. Although data-mining systems typically don’t pro- vide complicated query languages, to use these systems you need to choose a suitable mining algorithm and carefully tune various algorithm-speciﬁc parameters such as support and confidence for association rule mining, thresholds for clustering, training sets for clas- siﬁcation, and so on. These usability problems increase the number of iterations in the analysis process; you have to try algorithms with different parameters until you ﬁnd one that produces useful results. In addition, many of these tools require complicated, time-con- suming setup phases before they can be used at all. Most research in the areas of decision support, data visualization, statistics, data mining and knowledge discovery has concentrated on improving a single iter- ation of the analysis process. Some work has focused on improving the quality of a particular analysis result or on reducing the time it takes for each analysis step or algorithm to provide a complete response. These fields have progressed greatly, but this research focus ignores a basic invariant in computing: Full-scale data analysis will always be slow. As Greg Papadopoulos, chief technology ofﬁcer at Sun, points out, the appetite for data collection, storage, and analysis is outstripping Moore’s law, meaning that the time required to analyze massive data sets is steadily growing. To date, the result is a worst-case mode of human-computer interaction: Data analysis is a com- plex process involving multiple, time-consuming steps, and a poor or erroneous choice of inputs is not notice- able until results return at the end of a given step. The long delay and absolute lack of control during indi- vidual analysis steps disrupt the user’s concentration and hamper the data analysis process. This situation is reminiscent of Herodotus’ lament: “Of all men’s mis- eries, the bitterest is this: to know so much and have control over nothing.” In the Control (Continuous Output and Navigation Technology with Refinement Online) project at Berkeley, we are working with collaborators at IBM, Informix, and elsewhere to explore ways to improve human-computer interaction during data analysis. The Control project’s goal is to develop interactive, intu- itive techniques for analyzing massive data sets. We focus on systems that iteratively refine answers to queries and give users online control of processing, thereby tightening the data analysis process loop. You can use our techniques in diverse software contexts including decision support database systems, data visu- alization, data mining, and user interface toolkits. BATCH VERSUS ONLINE PROCESSING Traditional analysis tools have a black-box interface: The user issues queries, the system processes silently for a signiﬁcant period, and then the system returns an exact answer. Because of the long processing times, this inter- action is reminiscent of the batch processing of the 1960s Human insight is crucial for extracting meaning out of massive data sets, yet current user interactions with databases don’t allow iterative, intuitive analysis. The Control project looks at ways to give users quicker, more direct interactivity with the data. Joseph M. H ellerstein Ron Avnur Andy Chou Christian Hidber Chris O lston Vijayshankar Raman Tali Roth University of California, Berkeley Peter J. H aas IBM Almaden Research Center Cover Feature