A Visual Framework Invites Human into the Clustering Process Keke Chen Ling Liu College of Computing, Georgia Institute of Technology, Atlanta, GA 30332 {kekechen, lingliu}@cc.gatech.edu Abstract Clustering is a technique commonly used in scientific research. The task of clustering inevitably involves human participation – The clustering is not finished when the computer/algorithm finishes but the user has evaluated, understood and accepted the patterns. This defines a human involved “clustering- analysis/evaluation” iteration. Instead of neglecting this human involvement, we provide a visual framework (VISTA) with all power of algorithmic approaches (since their result can be visualized), and in addition we allow the user to steer/monitor/refine the clustering process with domain knowledge. The visual-rendering result also provides a precise pattern for fast post-processing. Keywords: Scientific Data Clustering, Information Visualization, VISTA, Human Factor in Computing 1. Introduction Clustering is a basic technique commonly used in data analysis tasks, where there is little prior information (e.g. statistical models) available about the data. In the past few decades, researchers have provided hundreds of clustering algorithms. Most of the researches have been focused on the efficient and effective clustering of the datasets with regular cluster distribution, in which clusters have spherical shapes and can be represented by centroids and radiuses approximately, but they do poorly (may produce high error rate) on skewed datasets, which have non-spherical regular or totally irregular cluster distributions. Some researchers have realized this problem and try to present cluster shapes as precisely as possible in the clustering process, such as representative-point based algorithm CURE [7] and density-based algorithm DBSCAN [22]. CURE uses several representative points to describe the boundary of a cluster approximately, instead of using one centriod only. This approach works for the non-spherical regular shapes, such as elongated regular shapes. However, it still does not work very well for clusters of irregular shapes. In general, the number of points used to represent a cluster increases as the complexity of its shape increases. Since the user may not know how irregular the cluster shape is, it is hard for her/him to know how many representative points are enough to describe the cluster boundary precisely. In general, it is very difficult to tune the parameters of the algorithm to find a satisfactory result, like the number of representative points in CURE, the MinPts and ε in DBSCAN. It is well known that, given a dataset, it is possible to have more than one criterion to partition the dataset with respect to different domain constraints. There is an interesting “whale, elephant, and tuna fish classification” example in [21], which illustrates that the same dataset may need to be partitioned differently for different purposes. It is also recognized that the automated algorithms are lack of the flexibility to enable people to realize the cluster shape and make any modification to the clustering result easily. Most frequently, the task of clustering is letting the user gets an initial understanding of the data; which means the clustering is not finished until the user has evaluated, understood and accepted the patterns or results. This defines a “clustering – analysis/evaluation” iteration. Instead of being neglected in this process, we think the user should be able to participate in the clustering process by providing the domain knowledge and making better decisions based on his perception. Therefore, we provide a visual framework (VISTA) with all power of algorithmic approaches (since their result can be just visualized), and in addition we allow the user to steer/monitor/refine the clustering process with any domain knowledge. The visual rendering result also provides a precise pattern for fast post-processing. There are three main contributions in this paper. • First, we provide a visual framework with all power of algorithmic approaches and, in addition, we allow the user to steer/monitor/refine the clustering process with domain knowledge. • Second, we introduce a visual cluster rendering system VISTA, which can visualize the result of any clustering algorithms, and help the user to understand and adjust the cluster distribution interactively. • Third, we present a map-based cluster encoding technique (ClusterMap) which provides a relatively precise pattern for fast labelling or classification, in the post-clustering phase.