Social Media Data Integration for Community Detection Jiliang Tang, Xufei Wang and Huan Liu Computer Science & Engineering, Arizona State University, Tempe, AZ 85281 {Jiliang.Tang, Xufei Wang, Huan.Liu}@asu.edu Abstract. Community detection is an unsupervised learning task that discovers groups such that group members share more similarities or interact more frequently among themselves than with people outside groups. In social media, link information can reveal heterogeneous re- lationships of various strengths, but often can be noisy. Since different sources of data in social media can provide complementary information, e.g., bookmarking and tagging data indicates user interests, frequency of commenting suggests the strength of ties, etc., we propose to inte- grate social media data of multiple types for improving the performance of community detection. We present a joint optimization framework to integrate multiple data sources for community detection. Empirical eval- uation on both synthetic data and real-world social media data shows sig- nificant performance improvement of the proposed approach. This work elaborates the need for and challenges of multi-source integration of het- erogeneous data types, and provides a principled way of multi-source community detection. Keywords: Community Detection, Multi-source Integration, Social Media Data 1 Introduction Social media is quickly becoming an integral part of our life. Facebook, one of the most popular social media websites, has more than 500 million users and more than 30 billion pieces of content shared each month 1 . YouTube attracts 2 billion video views per day 2 . Social media users can have various online social activities, e.g., forming connections, updating their status, and sharing their interested stories and movies. The pervasive use of social media offers research opportunities of group behavior. One fundamental problem is to identify groups among individuals if the group information is not explicitly available [1]. A group (or a community) can be considered as a set of users who interact more frequently or share more similarities among themselves than those outside the group. This topic has many applications such as relational learning, behavior modeling and 1 http://www.facebook.com/press/info.php?statistics 2 http://mashable.com/2010/05/17/youtube-2-billion-views/