An Evolutionary Approach to Constructive Induction for Link Discovery Tim Weninger, William H. Hsu, Jing Xia, Waleed Aljandal 234 Nichols Hall Kansas State University Manhattan, KS 66506 {weninger, bhsu, xiajing, waleed}@ksu.edu ABSTRACT This paper presents a genetic programming-based symbolic regression approach to the construction of relational features in link analysis applications. Specifically, we consider the problems of predicting, classifying and annotating friends re- lations in friends networks, based upon features constructed from network structure and user profile data. We first doc- ument a data model for the blog service LiveJournal, and define a set of machine learning problems such as predicting existing links and estimating inter-pair distance. Next, we explain how the problem of classifying a user pair in a social network, as directly connected or not, poses the problem of selecting and constructing relevant features. We use genetic programming to construct features, represented by multi- ple symbol trees with base features as their leaves. In this manner, the genetic program selects and constructs features that may not have been originally considered, but possess better predictive properties than the base features. Finally, we present classification results and compare these results with those of the control and similar approaches. General Terms Design, Experimentation Categories and Subject Descriptors I.2.6 [Computing Methodologies]: Learning; I.1.1 [Com- puting Methodologies]: Expressions and Their Represen- tation—symbolic regression Keywords machine learning, classification, genetic programming 1. INTRODUCTION Traditional data mining tasks such as association rule mining or market basket analysis attempt to find patterns in a dataset. According to Getoor, “This is consistent with the classical statistical inference problem of trying to iden- tify a model given a random sample from a common un- derlying distribution” [7]. However, it is important to also mine datasets that are relational, semi-structured or other- wise consist of links between various entities. These links can be explicit, such as an anchor tag in a web page, or Copyright is held by the author/owner(s). GECCO’09, July 8–12, 2009, Montréal Québec, Canada. ACM 978-1-60558-505-5/09/07. implied such as a join operation in a relational database. As shown by the PageRank algorithm used by the popu- lar search engine Google, link existence can be exploited to improve the predictive accuracy of learned models [4]. In- tuitively, attributes of linked objects are often more closely related than those of unlinked objects, and links are more likely to exist between objects that share common attributes [7]. Feature construction in the multi-relational setting is also possible. Traditionally, the attributes of an object provide the basic description of the object. However, by leveraging the information contained in the relationships of objects, more information about the object can be gleaned providing the learning algorithm an appropriate context for a better induction model. This work focuses mainly on the construction of features in order to improve link discovery, i.e., predicting the exis- tence of links between objects. In order to discover links not previously known to exist, a genetic programming approach is used to construct features that appropriately leverage the knowledge contained in the presence or absence of links. A brief overview of related material is given in the following sections, however, we refer readers unfamiliar with one or more of the following topics: constructive induction, social networks, and link mining, to [27, 20, 8, 26, 7]. 1.1 Constructive Induction We consider a genetic programming approach to theory formation that consists of relational feature construction. Synthetic features represented by GP expression trees in our framework are produced using classification fitness in a link existence prediction domain. This approach constitutes a kind of descriptive generalization [18]: the formulated the- ory characterizes a collection of entities. In the case of link mining, the primitive features of the theory are functions of candidate pairs, e.g., the number of mutual friends and inter- ests shared by two users in a social network. The role of GP is to synthesize functions of these relational primitive fea- tures that are validated using inductive learning algorithms, producing a fitness value for each synthetic feature. Most types of machine learning problems (including link mining) can be viewed as search problems [19] involving a large hypothesis space, that is, the space consisting of all possible theories under consideration. In this paradigm the goal of the search is to find the best theory with respect to the available training examples. Constructive induction is the process of improving the at- tribute vector of a learning algorithm in order to make the