Multi-Relational Link Prediction in Heterogeneous Information Networks Darcy Davis, Ryan Lichtenwalter, Nitesh V. Chawla Interdisciplinary Center for Network Science and Applications Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN, 46556 US {ddavis4,rlichten,nchawla}@nd.edu Abstract—Many important real-world systems, modeled nat- urally as complex networks, have heterogeneous interactions and complicated dependency structures. Link prediction in such networks must model the influences between heterogenous re- lationships and distinguish the formation mechanisms of each link type, a task which is beyond the simple topological features commonly used to score potential links. In this paper, we introduce a novel probabilistically weighted extension of the Adamic/Adar measure for heterogenous information networks, which we use to demonstrate the potential benefits of diverse evidence, particularly in cases where homogeneous relationships are very sparse. However, we also expose some fundamental flaws of traditional a priori link prediction. In accordance with previous research on homogeneous networks, we further demonstrate that a supervised approach to link prediction can enhance performance and is easily extended to the heterogeneous case. Finally, we present results on three diverse, real-world heterogeneous information networks and discuss the trends and tradeoffs of supervised and unsupervised link prediction in a multi-relational setting. I. I NTRODUCTION In network science, the link prediction task can be broadly generalized as follows: Given disjoint source node s and target node t, predict if the node pair has a relationship, or in the case of dynamic interactions, will form one in the near future [1]. For many real world scenarios, link prediction can be applied to anticipate future behavior or to identify probable relationships that are difficult or expensive to observe directly. In social networks, link prediction can be used to predict relationships that will form, uncover relationships that probably exist but have not been observed, or even to assist individuals in forming new connections [2]. In biomedicine, where exhaustive, reliable experimentation is usually not vi- able, link prediction techniques such as disease-gene candidate detection are valuable for navigating incomplete data, as well as guiding lab resources toward the most probable interactions. Many interesting real-world systems form complex net- works with multiple distinct types of inter-related objects and relationships. Structures of this type are broadly defined as het- erogeneous information networks [3], an umbrella term which encapsulates multi-mode, multi-dimensional, multi-relational, and bipartite networks [4]. Link prediction is these networks has typically been performed by treating all relationships equally or by separately studying homogeneous projections of the networks and ignoring dependency patterns across types. Both of these approaches represent a loss of information. Different edge types may have a different topology or link formation mechanisms but nonetheless influence each other, such that various combinations have different relevance to the link prediction task. For example, heterogenous relationships such as friendship, family, and colleague are often modeled as indistinct in social networks. In reality, though, it may be far more probable for person to form a new interaction with the colleague of a colleague than with the mother of a colleague. In the biological domain, networks are often modeled from a singular view of the cell, such as physical protein interaction only. Since many current molecular data sets are unreliable and many interaction types are correlated, it seems natural that integrating diverse information would be mutually beneficial. There is just one problem: the lack of an effective, general methods for link prediction in heterogeneous information networks. In this paper, we describe and evaluate two new approaches to the link prediction task in heterogeneous information net- works. In Section II, we describe three real-world heteroge- neous data sources and our evaluation framework. In Section III, we provide a brief survey of standard link prediction meth- ods. We then propose a probabilistic weighting scheme for extending the Adamic/Adar measure to heterogeneous data in Section III-C. Our experiments demonstrate that this extension can effectively use diverse evidence to improve performance over Adamic/Adar in many cases. On the other hand, our results also highlight the fact that extending existing methods cannot overcome the domain-specificity of unsupervised link predictors. In Section IV, we show that, just like homogeneous problems [5], supervised link prediction is still the best avail- able choice in terms of performance. This fundamental change to a general, variance-controlled, data-driven model is even more important in heterogeneous networks, where multiple distributions and decision boundaries are more of a guarantee than a possibility. Finally, we discuss interesting observations and conclusions in Section V. II. DATA We use three real-world heterogeneous information net- works to demonstrate the methods in this paper. The networks