De-anonymizing Social Networks and Inferring Private Attributes Using Knowledge Graphs Jianwei Qian * , Xiang-Yang Li ‡†* , Chunhong Zhang § , Linlin Chen * School of Software, Tsinghua University * Department of Computer Science, Illinois Institute of Technology School of Computer Science and Technology, University of Science and Technology of China § School of Information and Communication Engineering, Beijing University of Posts and Telecommunications Abstract—Social network data is widely shared, transferred and published for research purposes and business interests, but it has raised much concern on users’ privacy. Even though users’ identity information is always removed, attackers can still de-anonymize users with the help of auxiliary information. To protect against de-anonymization attack, various privacy protection techniques for social networks have been proposed. However, most existing approaches assume specific and restrict network structure as background knowledge and ignore semantic level prior belief of attackers, which are not always realistic in practice and do not apply to arbitrary privacy scenarios. Moreover, the privacy inference attack in the presence of se- mantic background knowledge is barely investigated. To address these shortcomings, in this work, we introduce knowledge graphs to explicitly express arbitrary prior belief of the attacker for any individual user. The processes of de-anonymization and privacy inference are accordingly formulated based on knowledge graphs. Our experiment on data of real social networks shows that knowledge graphs can strengthen de-anonymization and inference attacks, and thus increase the risk of privacy disclosure. This suggests the validity of knowledge graphs as a general effective model of attackers’ background knowledge for social network privacy preservation. Index Terms—Social network data publishing, attack and privacy preservation, knowledge graph. I. I NTRODUCTION Many online social networking sites like Facebook and Flickr have been generating tons of data every day, including users’ profiles, relations and personal life details. Social net- work data can be released to third-parties for various purposes including targeted advertising, developing new applications, academic research, and public competition [10], [27], [37], [43]. However, publishing social network data could also result in privacy leakage and thus raise great concerns among the public. Naively removing user IDs before publishing the data is far from enough to protect users’ privacy [5], [17], [19], [28]. The privacy issue in network data publishing is attract- ing increasing attention from researchers and social network providers [5], [15], [18], [24]. Various privacy attack and pro- tection techniques have been proposed, including k-anonymity [39] based techniques (e.g., k-degree anonymity [24]), and graph mapping based de-anonymization (e.g., [15]). Unfor- tunately, previous works have three main limitations. First, most of the prior attacks only focus on de-anonymization (also referred to as re-identification), that is, mapping a node in the network with a real person. Yet how the attacker acquires and infers users’ privacy after de-anonymization was barely dis- cussed before. Second, previous works have specific assump- tions about the attacker’s prior knowledge (also referred to as background information and we will use them interchangeably hereafter). Some assumes the attacker is weak and only has a specific type of information, such as node degrees [24]. Others assume that the attacker possesses a pure topological network (without user profiles) which overlaps with the published network data, such as [15]. The attacker is often assumed to be 100 percent sure of her prior knowledge. Conversely, some, if not much, of the attacker’s knowledge is probabilistic in the real world. Third, they typically do not consider the scenario that the attacker could exploit correlations among attributes to make inference about users’ sensitive attributes. Actually, the attacker can not only read users’ attributes directly from the published data, but also infer some attributes according to others. Take salary as an example, it is correlated with multiple attributes including gender, education and occupation, e.g., working as a doctor strongly implies high salary. To overcome these limitations, our goal is to construct a comprehensive and realistic model of the attacker’s knowledge and use this model to depict the privacy inferring process. The significance of our work lies in providing a better un- derstanding of the attacker’s prior knowledge and privacy inference, demonstrating the effectiveness of our method, alerting social network providers to privacy leakage risks, and leaving implications to researchers for designing general privacy protection (e.g., anonymization) techniques. We are facing three challenges. First, it is hard to build such an expressive model that covers all of the attacker’s prior knowledge, given that she may have various knowledge, varying from node profiles and degrees, to link relations and neighborhood subgraphs (called structural property), some of which are even probabilistic. Second, it is difficult to model the privacy inference steps, since the attacker may have various capabilities and techniques. She could be either computationally powerful or weak. She may have knowledge of some correlations among attributes, which are learned from information sources or by data mining. She may design her own algorithms to make inference with her prior knowledge and computation power. Third, it is challenging to quantify privacy disclosure. There has been no explicit and unified