2016 IEEE International Conference on Big Data (Big Data) 978-1-4673-9005-7/16/$31.00 ©2016 IEEE 969 Persistent Cascades: Measuring Fundamental Communication Structure in Social Networks Steven Morse Operations Research Center, MIT Draper Laboratory Cambridge, MA stmorse@mit.edu Marta C. Gonz´ alez Dept. Civil and Environmental Eng. MIT Cambridge, MA martag@mit.edu Natasha Markuzon Draper Laboratory Cambridge, MA nmarkuzon@draper.com Abstract—We define a new structural property of large-scale communication networks consisting of the persistent patterns of communication among users. We term these patterns “persistent cascades,” and claim they represent a strong estimate of actual information spread. Using metrics of inexact tree matching, we group these cascades into classes which we then argue represent the communication structure of a local network. This differs from existing work in that (1) we are focused on recurring patterns among specific users, not abstract motifs (e.g. the prevalence of triangles or other structures in the graph, regardless of user), and (2) we allow for inexact matching (not necessarily isomorphic graphs) to better account for the noisiness of human communication patterns. We find that analysis of these classes of cascades reveals new insights about information spread and the influence of certain users, based on three large mobile phone record datasets. For example, we find distinct groups of weekend vs. workweek spreaders not evident in the standard aggregated network. Finally, we create the communication network induced by these persistent structures, and we show the effect this has on measurements of centrality. I. INTRODUCTION A natural question to ask in the study of communication in social networks is: do social networks exhibit a recurring pattern of information spread? In this paper we propose methods which indicate the answer may be yes. Specifically, we present a method of extracting what appear to be the under- lying communication structures from the “noisy” information available in large-scale datasets. We focus our attention on mobile phone records, also termed call detail records (CDRs), because they provide a unique opportunity to study the large-scale, unfiltered communication patterns of individuals among their friends. Unfortunately, this breadth of knowledge — in time, space, and demographics — comes at the expense of depth, since we have no information about the purpose or content of communication as we might in social media or email records. Our approach attempts to solve this problem by finding persistent patterns that strongly imply meaningful communication is taking place. A. Related work A standard approach to translate raw communication data into a meaningful network is to aggregate user activity over some time period T (e.g. a week or month) into a static graph. For example, we can require that a call is reciprocated to consider two users social contacts (and assign them an edge) as in [21], and choose T such that it gives some stable representation (see [11]). An alternative approach is to include temporal knowledge, an interpretation broadly called temporal networks ([9]), which often improves our understanding of structure and community both at an aggregate and individual level. For example, in [19], they observe that the change in a user’s frequent contacts over time adheres to an apparent upper bound, or social capacity, that stays relatively constant for a user even as his/her contacts evolve. The temporal approach seems especially appropriate in the study of information spread, which is by nature causal and time-dependent. Strong properties of human interaction have emerged by including temporal information. One such is the property of “burstiness” — that is, people tend to communicate in short, active bursts followed by long periods of inactivity. The tendency for non-Poissonian, heavy-tailed inter-event communication times has been observed in many contexts (for example, [27] studies email virus propagation, [4] mobile phone communication, and [8] both mobile phone and email), and shown to slow diffusion dynamics ([7], [24]) except under certain conditions ([18]). A critical question in the study of information spread in temporal networks is determining what (or if) information is being spread during an observed communication event: is this call/email/tweet random, social, information-related, etc. In datasets like social media posts or email the answer is usually obvious from the text content; for example, using Twitter hash- tags as in [13], [15], [6]. However, in data like CDRs where we only have the metadata of each event, a solution is not obvious. In [1], they contrast the calling patterns immediately following an emergency (bombing, earthquake) with the rest of the call events, and find systematic differences in the timing and spread of information. The implication is that we are more sure “real” information spread is occurring following an emergency, and therefore the contrast of patterns between this spread and what we infer through a standard aggregated approach indicates the latter is an inaccurate estimate. This type of “cascading” information spread — i.e., a single user initiating a call to a few contacts, who then call several more, and so on — is of great interest in answering our question of the communication event’s purpose, since (broadly), a cascade implies non-random, or causal, action (see