International Journal of Computer Applications (0975 – 8887) Volume 77 – No.8, September 2013 5 SACK: Anonymization of Social Networks by Clustering of K-edge-connected Subgraphs Fatemeh Heidari Soureshjani Department of computer engineering, Payame Noor University, Po Box 19395-3697 Tehran, Iran Arash Ghorbannia Delavar Department of computer engineering, Payame Noor University, Po Box 19395-3697 Tehran, Iran Fatemeh Rashidi Department of computer engineering, Payame Noor University, Po Box 19395-3697 Tehran, Iran ABSTRACT In this paper, a method for anonymization of social networks by clustering of k-edge-connected subgraphs (SACK) is presented. Previous anonymization algorithms do not consider distribution of nodes in social network graph according to their attributes. SACk tries to focus on this aspect that probability of existence of an edge between two nodes is related to their attributes and this leads to a graph with connected subgraphs. Using connected subgraphs in anonymization process this method obtains better experimental results both in data quality and time. Sequential clustering is used for anonymization using k-edge connected subgraphs for starting step. Sequential clustering is a greedy algorithm and results are dependent on starting point. Keywords K-Anonymity, Social Networks, Privacy, Clustering, Information loss. 1. INTRODUCTION By ever increasing spread of social networks, a huge amount of data is collected from individuals and their relationships. These data are valuable resources for researchers in different areas including social psychology, sociology, statistics and market research.[14] However they can be a threat for privacy of individuals, the owners who data is about them. So there is an essential need for data anonymization before its release. Social networks almost are new data structures. But these privacy concerns have been considered in traditional datasets, where the data could be simple relational tables, including sensitive information about individuals. Social networks have a more complex data structure, which contains some structural data, in addition to descriptive data. If consider a social network as a graph, individuals are represented as nodes of graph. Each node has some descriptive data like age, gender, race, country, majority, and the edges represent the relationships between them. Most Social network privacy methods are inspired from traditional ones like K-anonymity, which is a widely used privacy model for data anonymization. K-anonymity was first presented in [1], [2] and is used in different areas, which analyze huge amounts of data for revealing the hidden knowledge, like data mining[15], [16], [17], [18]. This anonymity model uses the concept of quasi identifiers, defined as subset of attributes, which can be used in linkage with other data sets to reidentify sensitive private information of individuals. The idea of this anonymization model is to generalize or suppress values of quasi identifiers in a way that each record in data set cannot be distinguished from at least k-1 other records (in this case study, a node with its descriptive and structural information). In other words if there is a combination of values of quasi identifiers in dataset, it must occur at least k times. To use k-anonymity model for social networks, new definitions are needed to make it compatible with graph data structure. Recently some methods have been presented for k–anonymity of social network which can be noticed by their main ideas. the first category are methods which use edge addition/deletion or switching edges of graph to prevent adversaries from identifying individuals with their knowledge about the structure of graph [4], [5], [6], [7], [8], [9]. These methods make changes in graph structure and the released data is different from original data. In the second category, data saves its original structure, but nodes are clustered and then each cluster is replaced with a super node, which will have all information of its contained nodes, both structural and descriptive information [9], [10], [11]. This study falls into second category, and focuses on the case of anonymization of social networks by clustering. These privacy preservation techniques are almost new, and try to find a clustering of nodes which minimizes the information loss measure. These methods present better results by improvements of clustering algorithm. Present study tries to in clustering process, consider distribution of nodes in social network graph according to their attributes and focuses on this aspect that probability of existence of an edge between two nodes is related to their attributes, such as age, country, etc. This leads to a graph with connected components. For this purpose the concept k-edge-connected subgraphs is used, which can define how to use this connected components in clustering process, That will result in better clustering of nodes with purpose of decreasing information loss. 2. RELATED WORK K-anonymity of social networks by clustering was first considered by Zheleva and Geetor [11]. They presented the problem of sensitive relationships in social networks, and to address the problem, they used the concept link re- identification. Also they used a two-step anonymization method, first anonymization of descriptive information, without any attention to structural information. Then they presented five ways to anonymize the relationships. One of them is cluster-edge anonymization, which uses the aspect of anonymization of network by clustering. The first anonymization algorithm that considers both descriptive and structural information at the same time was SaNGreeA and presented by Campan and Truta [9]. SaNGreeA starts clustering by selecting one node as first cluster, and continues adding nodes to this cluster till its size reaches k, then builds another cluster. Each time a node is added to current cluster which minimizes information loss.