Graph-based Features for Supervised Link Prediction William Cukierski, Benjamin Hamner, Bo Yang Abstract— The growing ubiquity of social networks has spurred research in link prediction, which aims to predict new connections based on existing ones in the network. The 2011 IJCNN Social Network challenge asked participants to separate real edges from fake in a set of 8960 edges sampled from an anonymized, directed graph depicting a subset of relationships on Flickr. Our method incorporates 94 distinct graph features, used as input for classification with Random Forests. We present a three-pronged approach to the link prediction task, along with several novel variations on established similarity metrics. We discuss the challenges of processing a graph with more than a million nodes. We found that the best classification results were achieved through the combination of a large number of features that model different aspects of the graph structure. Our method achieved an area under the receiver-operator characteristic (ROC) curve of 0.9695, the 2nd best overall score in the competition and the best score which did not de-anonymize the dataset. I. I NTRODUCTION Directed graphs encapsulate relationships in social net- works, with nodes representing members of the network and edges signifying the relations between them. Link prediction, the task of forecasting new connections based on existing ones, is a topic of growing importance as digital networks grow in size and ubiquity [1], [2], [3], [4]. The study of network dynamics has numerous applications. Marketers would like to recommend products or services based on existing preferences or contacts. Social networking web- sites would like to customize suggestions for new friends and groups. Financial corporations would like to monitor transaction networks for fraudulent activity. In recognition of the growing demand for accurate link prediction, the 2011 IJCNN Social Network challenge asked participants to separate real edges from fake edges in a set of 8960 edges sampled from an anonymized, directed graph depicting a subset of relationships on Flickr. The graph representation of social networks allows exist- ing and powerful methods of graph theory to be applied to the task of link prediction. We refer the interested reader to Liben-Nowell and Kleinberg’s paper [2] for an excellent review of this topic. Though social networks share common representations as a graph, edges can have very different meanings, both within the same network and across different networks. What does an edge on the Flickr graph represent? At face value, each node comprises a user and each edge indicates a friendship, which need not be mutual. However, William Cukierski is with the Biomedical Engineering Department at Rutgers University and the University of Medicine and Dentistry of New Jersey. Benjamin Hamner is a Whitaker Fellow at the ´ Ecole Polytechnique ed´ erale de Lausanne. Bo Yang is a software developer in the Vancouver, Canada area. there are numerous reasons for friendship on a photo sharing site. It may be that two users are friends in real life, or they may share interest in a common subject matter, or they may share interest in a common style of photography. Recognizing the disparate meanings of graph edges leads to new interpretations of traditional link prediction methods. Our approach to the IJCNN Social Network Challenge follows a classical paradigm in supervised learning, start- ing with feature extraction, then preprocessing, and lastly repeated classification using the posterior probabilities from Random Forests [5]. Instead of presenting a single novel methodology for link prediction, the foremost contribution of this paper is in the breadth and variety of techniques incorporated into the feature extraction step. Sec. II describes this process in detail, starting with subgraph extraction, descriptions of the features, and finally, meta approaches to make valuable, new predictors from these features. Aspects of the work which are novel, such as the application of Bayesian Sets and the development of the three-problem approach, are discussed in more detail at the end of the section. II. FEATURE EXTRACTION We now introduce notation used throughout the paper. A graph is a set of vertices and directed edges (,), with associated adjacency matrix . When considering whether a specific edge  is real or fake, we label the outbound () node and the inbound () node . Γ() denotes all neighbors of node , while Γ  () and Γ  () denote just the inbound and outbound set of neighbors, respectively. We abbreviate the th column of as (:,), and the th row as (, :). The undirected form of the adjacency matrix, , is . The area under the receiver-operator characteristic (ROC) curve is abbreviated AUC. Link prediction methods are formed on the hypothesis that similar nodes are more likely to form connections. To this end, a link prediction method can be any general function (,) which produces a similarity ranking between all pairs of nodes in the graph. It is important to distinguish between the prediction task of the IJCNN challenge (namely, to separate real from fake edges in a given test set) and that of real-world link prediction, where one instead asks one of two questions: 1) Given a user , with whom is that user most likely to connect? 2) Given a graph , which user is most likely to make a connection, and with whom? These tasks are considerably more difficult. If the network is large, one must check candidate nodes to see which has the highest probability of forming a new connection, Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011 978-1-4244-9637-2/11/$26.00 ©2011 IEEE 1237