A Graph-based Similarity Metric and Validity Indices for Clustering Non-numeric and Unstructured Data Chuan Zhao , and Krishnamoorthy Sivakumar Abstract—We consider the problem of clustering for an unstructured dataset consisting of non-numeric attributes. Ex- amples of such datasets include emails, news articles, publica- tion/citation indexes, movies etc. In particular, we consider the clustering problem where the interrelationships are represented in a graphical format. We propose a node similarity metric to assess the similarity between nodes. We consider the K-medoid clustering method and introduce some postprocessing steps to improve the quality of clustering. A set of validity indices are proposed to assess the quality of the clustering results. To reduce computational complexity, a sampling strategy is introduced. Effect of sampling on the values of validity indices and clustering result is discussed. Finally, inﬂuence of different similarity metrics, postprocessing steps, and cluster number on the quality of clustering is discussed, both analytically and experimentally. Index Terms— clustering, unstructured data, similarity met- ric, validity index I. I NTRODUCTION Data mining and knowledge discovery in databases often requires dividing data into groups where similar objects are assigned to the same group. This task is often executed by clustering. Clustering groups objects with high similarity into a single cluster and separates objects with low similarity into different clusters. Data that requires clustering is often un- structured, consisting of non-numeric attributes (e.g., movie data, Email data, news articles). A movie contains a variety of different attributes including director, actor, role, etc. An Email has attributes like sender, receiver, subject, date, etc. All these attributes are non-numeric (with the exception of date). Clustering is usually performed based on distance between objects in the data set. When the attributes are non- numeric, there is no obvious notion of distance between two objects. Instead, we measure the degree of similarity between objects and consider an efﬁcient way to calculate this similarity value. In this paper, we consider a graphical representation of objects and relationships between them in a non-numeric dataset. In particular, we use a bipartite graph to represent the relationships between objects. The Relationship Gener- ating Graph Analysis Engine (REGGAE) [1] developed by Applied Technical Systems (ATS) is used as our database and This work was supported by Washington Technology Center and Applied Technical Systems (awards RTD08WSFMC2 and RTD09WSFMC2). Chuan Zhao is with the School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164-2752, USA (phone: 509-332-1372; email: zhaoc@eecs.wsu.edu). Krishnamoorthy Sivakumar is with the School of Electrical Engi- neering and Computer Science, Washington State University, Pullman, WA 99164-2752, USA (phone: 509-335-4969; fax: 509-335-3818; email: siva@eecs.wsu.edu). programming engine. We propose a similarity metric between objects based on REGGAE graph structure. Computation of this metric is facilitated by REGGAE. We implement our clustering algorithm using different variations of the proposed similarity metric. Our graph representation scheme, similarity metric, and clustering method are generally appli- cable to a variety of datasets. For concreteness, we use a movie dataset (IMDB) [2] to describe our work. We also propose metrics to evaluate the clustering results. We present two validity indices — Cohesion-Separation or CS for short and Representativeness of Medoid or RoM for short — which are based on the cohesion and separation of clusters. Furthermore, we present another validity index called Concentration of Similarity (CoS for short) which measures how well the nodes similar to a given node are concentrated in the same cluster. To reduce computations involved in calculating the validity indices, we sample the nodes in each cluster and calculate the validity indices based on the sampled nodes. We compare dif- ferent sampling methods based on their sampling errors and time complexities. Inspired by our experimental results, we present two theorems relating the validity index and cluster number. Based on our experimental results, the inﬂuence of different similarity metrics, clustering postprocessing steps, and cluster number on the quality of clustering is discussed. This paper is organized as follows. Section II presents the graph representation scheme. Section III describes the proposed similarity metric. Section IV presents our clustering algorithm, including the validity indices. Section V presents several theoretical results about the effect of the postprocess- ing step of the algorithm on the validity indices. Section VI summarizes and discusses our experimental results. Conclu- sions and directions for future work are presented in section VII. We conclude this section with a brief review of work related to node similarity and clustering validity indices. Blondel et al. [3] calculated the similarity matrix for each node pair in a graph with adjacency matrix for synonym extraction and web searching. Huang et al. [4] proposed a node structural metric to measure the similarity between nodes making use of the number of shared edges. They used this metric to cluster graph to simplify the graph rep- resentation. Lu et al. [5] explored the deﬁnition of similarity based on connectivity only, and proposed several algorithms for this purpose. Their metrics took advantage of the local neighborhoods of the nodes in the networked information space. Jeh [6] measured the structural-context similarity using bipartite similarity rank. Their basic idea was that two objects are similar if they are related to similar objects.