Label-dependent node classiﬁcation in the network $ Przemyslaw Kazienko, Tomasz Kajdanowicz n Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, Wroclaw, Poland article info Available online 17 September 2011 Keywords: Classiﬁcation Node classiﬁcation Label-dependent classiﬁcation Label-dependent features Collective classiﬁcation Classiﬁcation in networks LDBootstrapping LDGibbs Bootstrapping Gibbs sampling abstract Relations between objects in various systems, such as hyperlinks connecting web pages, citations of scientiﬁc papers, conversations via email or social interactions in Web 2.0 portals are commonly modeled by networks. One of many interesting problems currently studied for such domains is node classiﬁcation. Due to the nature of the networked data and the unavailability of collection of nodes’ broad representation for training in majority of environments, only a very limited data may remain useful for classiﬁcation. Therefore, there is a need for accurate and efﬁcient algorithms that are able to perform good classiﬁcation based only on scanty knowledge of network nodes. A new approach of sampling algorithm—LDGibbs, used in the context of collective classiﬁcation with application of label-dependent features, is proposed in the paper in order to provide more accurate generalization for sparse datasets. Additionally, a new LDBootstrapping algorithm based on label-dependent features has been developed. Both new algorithms include additional steps to extract new input features based on graph structures but limited only to the nodes of a given label. It means that a separate set of structural features is provided for each label. The comparison with the other approaches, in particular with standard Gibbs Sampling and bootstrapping provided satisfactory results and revealed LDGibbs’s superiority. & 2011 Elsevier B.V. All rights reserved. 1. Introduction Relations between objects in many various systems, such as hyperlinks connecting web pages, papers citations, conversations via email and social interaction in social portals are commonly modeled by networks. Those models are further a base for different types of processing and analyzes, in particular for node classiﬁcation (labeling of the nodes in the network), analysis of network dynamics (especially discovery of network evolution) as well as knowledge and information diffusion. One of the most important directions in research on networks is node classiﬁca- tion, which has a deep theoretical background. However, due to new phenomenon appearing in artiﬁcial environments like social networks on the Internet, the problem of node classiﬁcation is being recently re-invented and re-implemented. Classiﬁcation of nodes in the network may be performed either by means of known proﬁles of these nodes (regular concept of classiﬁcation) or based on information derived from known classes (labels) of the other nodes in the network. This second approach may be easily extended by information about connections between nodes (structure of the network) and can be very useful in assigning labels to the nodes being classiﬁed. For example, it is very likely that a given web page x is related to sport (label sport), if x is linked by many other web pages about sport. In the case of social network, somebody is likely to be recognized as highly-educated, if one possesses close friends (strong, direct relationships in the social network), who are also highly-educated. Hence, a form of collective classiﬁcation should be provided, with simultaneous decision making on every node’s label [21,28,35] rather than classifying each node separately. Such approach allows taking into account correlations between con- nected nodes, which deliver usually undervalued knowledge. To utilize speciﬁc proﬁle of the network in order to improve accuracy of classiﬁcation, three distinct types of correlations could be considered while classifying an unknown node x [33]: 1. the correlations between the label (class) of a given node x and the known attributes of x, i.e. x’s proﬁle (regular approach of classi- ﬁcation requiring, however, some knowledge about x’s proﬁle), 2. the correlations between the label of node x and all known attributes of nodes in x’s neighborhood; it includes also the known neighbors’ labels, 3. the correlations between the label of x and the nodes with the unknown labels in the neighborhood of x; it may also include the attributes (proﬁles) of these nodes. In general, a term collective classiﬁcation refers to the combined classiﬁcation of interlinked nodes using all three above men- tioned types of information [33]. The third kind of correlations Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.04.047 $ This work was supported by The Polish Ministry of Science and Higher Education, the research project, 2011–12, the research project, 2010–13, and Fellowship co-ﬁnanced by European Union within European Social Fund. n Corresponding author. E-mail addresses: kazienko@pwr.wroc.pl (P. Kazienko), tomasz.kajdanowicz@pwr.wroc.pl (T. Kajdanowicz). Neurocomputing 75 (2012) 199–209