COALA - Correlation-Aware Active Learning of Link Speciﬁcations Axel-Cyrille Ngonga Ngomo, Klaus Lyko, and Victor Christen Department of Computer Science AKSW Research Group University of Leipzig, Germany {ngonga|klaus.lyko|christen}@informatik.uni-leipzig.de Abstract. Link Discovery plays a central role in the creation of knowl- edge bases that abide by the ﬁve Linked Data principles. Over the last years, several active learning approaches have been developed and used to facilitate the supervised learning of link speciﬁcations. Yet so far, these approaches have not taken the correlation between unlabeled examples into account when requiring labels from their user. In this paper, we ad- dress exactly this drawback by presenting the concept of the correlation- aware active learning of link speciﬁcations. We then present two generic approaches that implement this concept. The ﬁrst approach is based on graph clustering and can make use of intra-class correlation. The second relies on the activation-spreading paradigm and can make use of both intra- and inter-class correlations. We evaluate the accuracy of these ap- proaches and compare them against a state-of-the-art link speciﬁcation learning approach in ten diﬀerent settings. Our results show that our approaches outperform the state of the art by leading to speciﬁcations with higher F-scores. Keywords: Active Learning, Link Discovery, Genetic Programming 1 Introduction The importance of the availability of links for a large number of tasks such as question answering [20] and keyword search [19] as well as federated queries has been pointed out often in literature (see, e.g., [1]). Two main problems arise when trying to discover links between data sets or even deduplicate data sets. First, naive solutions to Link Discovery (LD) display a quadratic time complex- ity [13]. Consequently, they cannot be used to discover links across large datasets such as DBpedia 1 or Yago 2 . Time-eﬃcient algorithms such as PPJoin+ [21] and HR 3 [11] have been developed to address the problem of the a-priori quadratic runtime of LD approaches. While these approaches achieve practicable runtimes even on large datasets, they do not guarantee the quality of the links that are 1 http://dbpedia.org 2 http://www.mpi-inf.mpg.de/yago-naga/yago/