Learning Social Networks from Web Documents Using Support Vector Classiﬁers Masoud Makrehchi, Mohamed S. Kamel Pattern Analysis and Machine Intelligence Lab Department of Electrical and Computer Engineering University of Waterloo, Waterloo, Ontario N2L 3G1, Canada {mmakrehc,mkamel}@uwaterloo.ca Abstract Automatic generation of a social network requires ex- tracting pair-wise relations of the individuals. In this re- search, Learning social network from incomplete relation- ship data is proposed. It is assumed that only a small subset of relations between the individuals is known. With this as- sumption, the social network extraction is translated into a text classiﬁcation problem. The relations between two indi- viduals are modeled by merging their document vectors and the given relations are used as labels of training data. By this transformation, a text classiﬁer such as SVM is used for learning the unknown relations. We show that there is a link between the intrinsic sparsity of social networks and class distribution imbalance of the training data. In or- der to re-balance the unbalanced training data, a minority class down-sampling strategy is employed. The proposed framework is applied to a true FOAF (Friend Of A Friend) database and evaluated by the macro-averaged F-measure. 1 Introduction A social network is deﬁned as a map of relationship (tie) between individuals (actors). During the last three years, social networks such as orkut 1 and friendster 2 have had sub- stantial growth in terms of web trafﬁc. The majority of these networks are for personal and socialization purposes. How- ever, they are interesting for marketing and advertising due to their exponentially increasing trafﬁc. Finding friends network of a person is more personal and private issue. However, in a small community such as a virtual classroom in an e-learning system, by creating the 1 https://www.orkut.com/ 2 http://www.friendster.com/ social network online, we can offer the students a list of in- dividuals of similar interest who can share their knowledge, questions, comments and interests towards the educational matters. Scenarios might include preparing a course paper, developing a course note or getting a feedback about a lec- ture. In all cases, the system can provide the user a list of potential friends who can help her to do the task. This paper proposes an approach to automatically gener- ating a social network from a collection of web documents. In order to (semi)automatically generate a social network, it is required to represent each individual person by a set of features or attributes. Using web resources, every person can be represented by her corresponding documents, which is modeled by vector space model. Using vector space doc- ument representation, each person is described by a set of single-word terms from a local dictionary called vocabulary. Associating people in the community to the terms in the vo- cabulary, the new structure is called “actor-term matrix”. The next step is learning social relations from actor-term data base. Similar to other machine learning applications, if there is any training data, social network can be extracted using a supervised classiﬁcation approach. In this case, the training data is a set of known relationships among some actors. If no pair-wise relation is known, the learning is unsupervised, which is based on pair-wise similarities and clustering. In this paper, we assume that the social network is par- tially explored. Using the revealed relations in the social network as training data, a support vector classiﬁer is em- ployed to extract the missing relations to complete the social network. The paper consists of seven sections. After the introduc- tion, related works are brieﬂy reviewed in Section 2. In Section 3, the problem statement is described. The pro- posed approach is detailed in section 4, and in section 5, we brieﬂy discuss the data set used in this research. The experimental result and discussion are presented in section Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06) 0-7695-2747-7/06 $20.00 © 2006