Label-dependent node classification in the network $ Przemyslaw Kazienko, Tomasz Kajdanowicz n Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, Wroclaw, Poland article info Available online 17 September 2011 Keywords: Classification Node classification Label-dependent classification Label-dependent features Collective classification Classification in networks LDBootstrapping LDGibbs Bootstrapping Gibbs sampling abstract Relations between objects in various systems, such as hyperlinks connecting web pages, citations of scientific papers, conversations via email or social interactions in Web 2.0 portals are commonly modeled by networks. One of many interesting problems currently studied for such domains is node classification. Due to the nature of the networked data and the unavailability of collection of nodes’ broad representation for training in majority of environments, only a very limited data may remain useful for classification. Therefore, there is a need for accurate and efficient algorithms that are able to perform good classification based only on scanty knowledge of network nodes. A new approach of sampling algorithmLDGibbs, used in the context of collective classification with application of label-dependent features, is proposed in the paper in order to provide more accurate generalization for sparse datasets. Additionally, a new LDBootstrapping algorithm based on label-dependent features has been developed. Both new algorithms include additional steps to extract new input features based on graph structures but limited only to the nodes of a given label. It means that a separate set of structural features is provided for each label. The comparison with the other approaches, in particular with standard Gibbs Sampling and bootstrapping provided satisfactory results and revealed LDGibbs’s superiority. & 2011 Elsevier B.V. All rights reserved. 1. Introduction Relations between objects in many various systems, such as hyperlinks connecting web pages, papers citations, conversations via email and social interaction in social portals are commonly modeled by networks. Those models are further a base for different types of processing and analyzes, in particular for node classification (labeling of the nodes in the network), analysis of network dynamics (especially discovery of network evolution) as well as knowledge and information diffusion. One of the most important directions in research on networks is node classifica- tion, which has a deep theoretical background. However, due to new phenomenon appearing in artificial environments like social networks on the Internet, the problem of node classification is being recently re-invented and re-implemented. Classification of nodes in the network may be performed either by means of known profiles of these nodes (regular concept of classification) or based on information derived from known classes (labels) of the other nodes in the network. This second approach may be easily extended by information about connections between nodes (structure of the network) and can be very useful in assigning labels to the nodes being classified. For example, it is very likely that a given web page x is related to sport (label sport), if x is linked by many other web pages about sport. In the case of social network, somebody is likely to be recognized as highly-educated, if one possesses close friends (strong, direct relationships in the social network), who are also highly-educated. Hence, a form of collective classification should be provided, with simultaneous decision making on every node’s label [21,28,35] rather than classifying each node separately. Such approach allows taking into account correlations between con- nected nodes, which deliver usually undervalued knowledge. To utilize specific profile of the network in order to improve accuracy of classification, three distinct types of correlations could be considered while classifying an unknown node x [33]: 1. the correlations between the label (class) of a given node x and the known attributes of x, i.e. x’s profile (regular approach of classi- fication requiring, however, some knowledge about x’s profile), 2. the correlations between the label of node x and all known attributes of nodes in x’s neighborhood; it includes also the known neighbors’ labels, 3. the correlations between the label of x and the nodes with the unknown labels in the neighborhood of x; it may also include the attributes (profiles) of these nodes. In general, a term collective classification refers to the combined classification of interlinked nodes using all three above men- tioned types of information [33]. The third kind of correlations Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.04.047 $ This work was supported by The Polish Ministry of Science and Higher Education, the research project, 2011–12, the research project, 2010–13, and Fellowship co-financed by European Union within European Social Fund. n Corresponding author. E-mail addresses: kazienko@pwr.wroc.pl (P. Kazienko), tomasz.kajdanowicz@pwr.wroc.pl (T. Kajdanowicz). Neurocomputing 75 (2012) 199–209