Science and Information Conference 2014 August 27-29, 2014 | London, UK 777 | Page www.conference.thesai.org Utilizing Deep Learning for Content-based Community Detection Hassan Abbas Abdelbary Computer Science Department, Faculty of Computers and Information, Zagazig University Elzeraa Square, Zagazig, Egypt haabdelbary@zu.edu.eg Abeer Mohamed ElKorany Computer Science department Faculty of Computers and Information Cairo University 5 Dr.Ahmed Zewail Street, Giza, Egypt a.korani@fci-cu.edu.eg Reem Bahgat Computer Science department Faculty of Computers and Information Cairo University 5 Dr.Ahmed Zewail Street, Giza, Egypt r.bahgat@fci-cu.edu.eg Abstract— Online social networks have been wildly spread in recent years. They enable users to identify other users with common interests, exchange their opinions, and expertise. Discovering user communities from social networks have become one of the major challenges which help its members to interact with relevant people who have similar interests. Community detection approaches fall into two categories: the first one considers user’ networks while the other utilizes user- generated content. In this paper, a multi-layer community detection model based on identifying topics of interest from user published content is presented. This model applies Gaussian Restricted Boltzmann Machine for modeling user’s posts within a social network which yields to identify their topics of interest, and finally construct communities. The effectiveness of the proposed multi-layer model is measured using KL divergence which measures similarity between users of the same community. Experiments on the real Twitter dataset show that the proposed deep model outperforms traditional community detection models that directly maps users into corresponding communities using several baseline techniques. Keywords—Community detection; Topic modeling; deep learning; Restricted Boltzmann Machines; Replicated softmax; K- means I. INTRODUCTION In the current social web, community activities have rapidly increased. Users join online communities in order to share their ideas, beliefs, and expertise with group of people who have common interests. Discovering hidden communities from this rich pool of information is considered a significant challenge. A community is a collection of users who share the same interests and interact among each other most likely more than other users in the network without knowing each other a prior. Discovering these communities finds its importance in many applications like marketing, elections, stock index, and computer science. Community discovery helps to connect relevant people who have similar interests and encourages them to contribute and share more content. Furthermore, it gives insights about the dynamics within each community and provides a good indicator about the status of the whole network and its health. However, discovering common interests shared by users is a fundamental problem in social networks. Two main approaches are used to discover shared interests in social networks. One is user-centric, which focuses on detecting social interests based on the social interaction among users; the other is item-centric, which detects common interests based on the common items such as hobbies, behavior, or topics of discussion. The first approach considers the network structure as a graph constructed of nodes and edges where the nodes represent the users and edges represent the connections among those users. So discovering communities based on network analysis is considered as a graph clustering problem. While the second approach analyzes published content by users, which represents their interests in order to discover communities. Content broadcasted by users could be: posts, blogs, emails, tags, or tweets which represent topics that is used to identify the user’s interest and hence detect communities that share the same interest. Clustering users based on their published content could be accomplished by applying unsupervised machine learning techniques such as K-means, or expectation maximization that is considered as the probabilistic version of K-means. The main objective of this paper is to model users’ interest based on published content, and group corresponding users into communities according to mutual interests. The proposed model starts by collecting published content by each user and modeling the users’ interests, which are represented by discrete topic distributions using Restricted Boltzmann Machine (RBM). Accordingly, communities are identified according to discovered topics of interest. RBM is a flexible model for complex data. However, using RBMs for high-dimensional multinomial observations poses significant computational difficulties. In order to overcome these difficulties, two hidden layers Boltzmann Machine are used during learning. Thus, our proposed model adds another layer of hidden units on top of the first hidden layer with bi-partite, undirected connections. The new connections come with a new set of weights that enhance the accuracy as the experiments show. In order to illustrate the effectiveness of our model, a set of experiments on real Microblogs dataset is applied. Nowadays, microblogging has been a widespread social networking service whose population has extremely increased in the past few years. Experiments using twitter dataset show that the proposed multi-layer RBM outperforms other state-of-the-art clusters approaches such as k-means as well as a single layer RBM. Closeness between members of discovered