Extracting Social Networks from Instant Messaging Populations John Resig, Santosh Dawara, Christopher M. Homan, and Ankur Teredesai Center for Discovery Informatics, Laboratory for Applied Computing, Rochester Institute of Technology Rochester, NY 14623, USA jer5513,sgd9494,cmh,amt@cs.rit.edu ABSTRACT In the analysis of large-scale social networks, a central prob- lem is how to discover how members of the network to be analyzed are related. Instant messaging (IM) is a popu- lar and relatively new form of social interaction. In this paper we study IM communities as social networks. An ob- vious barrier to such a study is that there is no de facto measure for how closely any pair of members of such a com- munity are associated to describe the link information. We introduce several such measures in this paper. These pro- posed measures are obtained solely from the status logs of IM users. The status log of an IM user is a list of pairs of the form (time, state ), where state is an element of a small set, such as {online, of f line, busy, away}, and time is the time at which the member switched into that state. Resig et al. show [12] that, in spite of their simplicity, status logs contain a great deal of structure. Since any pair of IM users can instant message each other only if they are both online at the same time, it seems reasonable to guess that any two IM users that are frequently online at the same time may in fact be frequently instant messaging each other. This hy- pothesis forms the basis of each of our association measures. For a chosen population of IM users, we compare the social networks obtained using our relationship measures to the so- cial network formed in LiveJournal (www.livejournal.com) by the same population. LiveJournal is a blogging commu- nity that allows users to explicitly name other LiveJournal users as associates. The network obtained by these associ- ation lists thus acts as a control of sorts for validating our IM-based association measure. 1. INTRODUCTION Instant messaging (IM) is a popular form of computer-based communication. By deﬁnition, IM is a communications ser- vice that enables its users to create a kind of private chat room with another individual that allows communication in real time over the Internet, similar to a telephone conver- sation but (typically) using text rather than voice. The in- LinkKDD’04, August 22, 2004, Seattle, Washington, USA. stant messaging system alerts its users whenever somebody on their private list is online. Users can then initiate a chat session with that particular individual [1]. IM technology lets users communicate across networks, in remote areas, and in a highly pervasive and ubiquitous manner. Indus- trial and governmental organizations are very interested in understanding the nature of broad knowledge-sharing net- works that exist within their organizations. IM communica- tion is fast becoming a standard platform for such networks. Apart from a fundamental interest in knowing “who IM’s whom and how often?” it is also useful as a test bed from a social network analysis viewpoint. From a data mining per- spective, IM produces data at many levels of detail, ranging from state-change logs to text messages, and the data at each of these levels are rich in information. The problem of collecting, analyzing, and exploring this data has, until recently, gone mostly unexplored. Even the right questions to ask of them are not yet established, to say nothing of the algorithms required to eﬃciently answer the questions once they are posed. The IMSCAN framework is one such attempt to formulate and attempt solutions for such ques- tions. In this paper we focus on the particular problem of how to extract and analyze social relationships between the users of an IM service using the IMSCAN framework. A collection of such relationships between members of a population is called a social network. Social networks are widely studied, although often they are notoriously hard to analyze in any great depth. There are innumerable ways in which overlap- ping social networks can be derived from a population. This derivative is primarily dependent on the metrics used to de- termine the relationships. For instance, in a given group of people, any two people A and B could be considered related if A is the parent of B or if A knows B on a ﬁrst-name ba- sis or if A and B ever during June 2004 dined in the same restaurant. Relations can be either bi- or uni- directional. They can also be weighted; For instance, we could declare that the degree to which A and B are related is the number of times during June 2004 that A and B dined in the same restaurant at the same time. In each case, however, when we talk of a social network, we usually intend for the re- lation deﬁning the network to indicate the degree to which some kind of meaningful social relationship exists between the members of the network (in the case of non-weighted relations the degree to which two members are related is either absolute or nonexistent).