A Matrix Alignment Approach for Collective Classification Jerry Scripps, Pang-Ning Tan, Feilong Chen and Abdol-Hossein Esfahanian Computer Science and Engineering Michigan State University E. Lansing, MI 48824 {scripps,ptan,chenfeil,esfahani}@msu.edu Abstract—Within networks there is often a pattern to the way nodes link to one another. It has been shown that the accuracy of node classification can be improved by using the link data. One of the challenges to integrating the attribute and link data, though, is balancing the influence that each has on the classification decision. In this paper we present a matrix alignment approach to the problem of collective classification which weights the attributes and the links according to their predictive influence. The experiments show that while our approach provides comparable accuracy in prediction to other methods, it is also very fast and descriptive. I. I NTRODUCTION Although predicting the class of an instance is a well established area of research, classification of nodes in a net- work is an active [1], [4], [10], [8], [2], [7] subject of study. Networks have links between the nodes, which potentially offer more information to make more accurate predictions. However, making use of the links has proven to be more than just extending an existing model. The reason is the different nature of link and attributes. Classifiers use attribute information which ordinarily results in a simple n × d data matrix – there are n nodes, each having d attributes. Links establish relationships between nodes commonly represented by an n × n adjacency matrix. The problem of utilizing the link data becomes one of integrating the node-based data matrix with the node-pair-based adjacency matrix. Figure 1. Sample network The problem is illustrated in Figure 1. In this simple network, the shape of a node reveals its class. Imagine that we wish to decide whether the node at the bottom belongs to the square or the oval class. In this network, we can deduce that the first two attributes are helpful in determining the class as well as the links. The new node has a strong attribute similarity to the oval nodes but is strongly linked to the square ones. It would be helpful in this case to know the relative importance of links versus attributes when predicting the class. In this paper we present a new approach to the problem of collective classification. Our framework learns a set of weights that can be applied to the attributes and the links simultaneously. Besides being predictive the weights also provide a relative measure of the importance of not only the attributes but also the links. This means that we can assign nodes to classes and then be able to tell why they were assigned to the class. In our formulation we assume that there are no missing attribute or link values. The methodology for our approach is described in Section II. The results of our experiments are presented in Section III. Section IV discusses other work in this area and the conclusions are presented in Section V. II. METHODOLOGY Using both the link and attribute data to predict the class of a node is inherently troublesome because of the nature of the two data sources. Attribute data is node based while link data is node-pair based. Integrating the two types provides an interesting challenge. To meet this challenge, in the matrix alignment approach, the goal is to form a relationship between links and the attributes by using n × n, node-pair, matrices. Let a network be represented as a graph G =(V,E), where V = {1, 2, ..., |V |} is the set of nodes and E V ×V is the set of edges. The adjacency matrix A =(a ij ) n×n represents the links structure, where a ij =1 if there is a link between nodes i and j and zero otherwise. Each node v V references an object (e.g. person, gene or web page) and is associated with a set of d attributes. The attribute information for each node can be encoded in an n × d data matrix X =(x ik ) n×d , where each row of the matrix corresponds to a vector of attribute values for a node in the network. For brevity, we assume the node attributes for all the networks considered in this study have been properly normalized or standardized. As a result, we may represent the attribute similarity between nodes using the dot product of the attribute matrix XX T .