Semi-supervised Classification in Graphs using Bounded Random Walks Semi-supervised learning, large graphs, betweenness measure, passage times J´ erˆ ome Callut Jerome.Callut@uclouvain.be Kevin Fan¸ coisse Kevin.Francoisse@uclouvain.be Marco Saerens Marco.Saerens@uclouvain.be UCL Machine Learning Group (MLG) Louvain School of Management, IAG, Universit´ e catholique de Louvain, B-1348 Louvain-la-Neuve, Belgium Pierre Dupont Pierre.Dupont@uclouvain.be UCL Machine Learning Group (MLG) Department of Computing Science and Engineering, INGI, Universit´ e catholique de Louvain, B-1348 Louvain-la-Neuve, Belgium Abstract This paper describes a novel technique, called D-walks, to tackle semi-supervised classifi- cation problems in large graphs. We in- troduce here a betweenness measure based on passage times during random walks of bounded lengths in the input graph. The class of unlabeled nodes is predicted by maximizing the betweenness with labeled nodes. This approach can deal with di- rected or undirected graphs with a linear time complexity with respect to the num- ber of edges and the maximum walk length considered. Preliminary experiments on the CORA database show that D-walks outper- forms NetKit (Macskassy & Provost, 2007) as well as Zhou et al. algorithm (Zhou et al., 2005), both in classification rate and comput- ing time. 1. Introduction This paper is concerned with semi-supervised classifi- cation of nodes in a graph. Given an input graph with some nodes being labeled, the problem is to predict the missing node labels. This problem has numerous ap- plications such as classification of individuals in social networks, linked documents categorization or protein function prediction, to name a few. Several approaches have been proposed to tackle semi- supervised classification problems in graphs. Kernel methods (Zhou et al., 2005; Tsuda & Noble, 2004) embed the nodes of the input graph into an Euclidean feature space where a classifier, such as a SVM, can be estimated. Despite of their good predictive perfor- mance, these techniques cannot easily scale up to large problems due to their high time complexity. NetKit is an alternative relational learning approach (Macskassy & Provost, 2007). It has a lower computational com- plexity but is less simple conceptually and may require to fine tune several of its components. The approach proposed in this paper, called D-walks, relies on random walks performed on the input graph seen as a Markov chain. More precisely, a betweenness measure, based on passage times during random walks of bounded length, is derived for each class (or label category). Unlabeled nodes are assigned to the cate- gory for which the betweenness is the highest. The D- walks approach has the following properties: (i) it has a linear time complexity with respect to the number of edges and the maximum walk length considered; such a low complexity allows to deal with very large graphs, (ii) it can handle directed or undirected graphs, (iii) it can deal with multi-class problems and (iv) it has a unique hyper-parameter that can be tuned efficiently. 2. Discriminative random walks We are given an input graph G containing a set of nodes N and edges E . The (possibly weighted) adja- cency matrix is denoted A. The graph G is assumed partially labeled. The nodes in the labeled set L⊂N are assigned to a category from a discrete set Y . The unlabeled set is defined as U = N\L. Random walks in a graph can be modeled by a Proceedings of the 17th Annual Machine Learning Conference of Belgium and the Netherlands (Benelearn), pp. 67–68, Liege, Belgium, May 19-20, 2008.