Graph-based Features for Supervised Link Prediction
William Cukierski, Benjamin Hamner, Bo Yang
Abstract— The growing ubiquity of social networks has
spurred research in link prediction, which aims to predict new
connections based on existing ones in the network. The 2011
IJCNN Social Network challenge asked participants to separate
real edges from fake in a set of 8960 edges sampled from an
anonymized, directed graph depicting a subset of relationships
on Flickr. Our method incorporates 94 distinct graph features,
used as input for classification with Random Forests. We present
a three-pronged approach to the link prediction task, along with
several novel variations on established similarity metrics. We
discuss the challenges of processing a graph with more than a
million nodes. We found that the best classification results were
achieved through the combination of a large number of features
that model different aspects of the graph structure. Our method
achieved an area under the receiver-operator characteristic
(ROC) curve of 0.9695, the 2nd best overall score in the
competition and the best score which did not de-anonymize
the dataset.
I. I NTRODUCTION
Directed graphs encapsulate relationships in social net-
works, with nodes representing members of the network and
edges signifying the relations between them. Link prediction,
the task of forecasting new connections based on existing
ones, is a topic of growing importance as digital networks
grow in size and ubiquity [1], [2], [3], [4]. The study
of network dynamics has numerous applications. Marketers
would like to recommend products or services based on
existing preferences or contacts. Social networking web-
sites would like to customize suggestions for new friends
and groups. Financial corporations would like to monitor
transaction networks for fraudulent activity. In recognition
of the growing demand for accurate link prediction, the
2011 IJCNN Social Network challenge asked participants
to separate real edges from fake edges in a set of 8960
edges sampled from an anonymized, directed graph depicting
a subset of relationships on Flickr.
The graph representation of social networks allows exist-
ing and powerful methods of graph theory to be applied to
the task of link prediction. We refer the interested reader
to Liben-Nowell and Kleinberg’s paper [2] for an excellent
review of this topic. Though social networks share common
representations as a graph, edges can have very different
meanings, both within the same network and across different
networks. What does an edge on the Flickr graph represent?
At face value, each node comprises a user and each edge
indicates a friendship, which need not be mutual. However,
William Cukierski is with the Biomedical Engineering Department at
Rutgers University and the University of Medicine and Dentistry of New
Jersey.
Benjamin Hamner is a Whitaker Fellow at the
´
Ecole Polytechnique
F´ ed´ erale de Lausanne.
Bo Yang is a software developer in the Vancouver, Canada area.
there are numerous reasons for friendship on a photo sharing
site. It may be that two users are friends in real life, or
they may share interest in a common subject matter, or
they may share interest in a common style of photography.
Recognizing the disparate meanings of graph edges leads to
new interpretations of traditional link prediction methods.
Our approach to the IJCNN Social Network Challenge
follows a classical paradigm in supervised learning, start-
ing with feature extraction, then preprocessing, and lastly
repeated classification using the posterior probabilities from
Random Forests [5]. Instead of presenting a single novel
methodology for link prediction, the foremost contribution
of this paper is in the breadth and variety of techniques
incorporated into the feature extraction step. Sec. II describes
this process in detail, starting with subgraph extraction,
descriptions of the features, and finally, meta approaches to
make valuable, new predictors from these features. Aspects
of the work which are novel, such as the application of
Bayesian Sets and the development of the three-problem
approach, are discussed in more detail at the end of the
section.
II. FEATURE EXTRACTION
We now introduce notation used throughout the paper. A
graph is a set of vertices and directed edges (,), with
associated adjacency matrix . When considering whether
a specific edge
is real or fake, we label the outbound
() node
and the inbound ( ) node
. Γ() denotes all
neighbors of node , while Γ
() and Γ
() denote just
the inbound and outbound set of neighbors, respectively. We
abbreviate the th column of as (:,), and the th row as
(, :). The undirected form of the adjacency matrix,
, is
∨
. The area under the receiver-operator characteristic
(ROC) curve is abbreviated AUC.
Link prediction methods are formed on the hypothesis that
similar nodes are more likely to form connections. To this
end, a link prediction method can be any general function
(,) which produces a similarity ranking between all
pairs of nodes in the graph. It is important to distinguish
between the prediction task of the IJCNN challenge (namely,
to separate real from fake edges in a given test set) and that
of real-world link prediction, where one instead asks one of
two questions:
1) Given a user
, with whom is that user most likely to
connect?
2) Given a graph , which user
is most likely to make
a connection, and with whom?
These tasks are considerably more difficult. If the network
is large, one must check candidate nodes to see which
has the highest probability of forming a new connection,
Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011
978-1-4244-9637-2/11/$26.00 ©2011 IEEE 1237