2016 IEEE International Conference on Big Data (Big Data)
978-1-4673-9005-7/16/$31.00 ©2016 IEEE 969
Persistent Cascades: Measuring Fundamental Communication
Structure in Social Networks
Steven Morse
Operations Research Center, MIT
Draper Laboratory
Cambridge, MA
stmorse@mit.edu
Marta C. Gonz´ alez
Dept. Civil and Environmental Eng.
MIT
Cambridge, MA
martag@mit.edu
Natasha Markuzon
Draper Laboratory
Cambridge, MA
nmarkuzon@draper.com
Abstract—We define a new structural property of large-scale
communication networks consisting of the persistent patterns of
communication among users. We term these patterns “persistent
cascades,” and claim they represent a strong estimate of actual
information spread. Using metrics of inexact tree matching, we
group these cascades into classes which we then argue represent
the communication structure of a local network. This differs from
existing work in that (1) we are focused on recurring patterns
among specific users, not abstract motifs (e.g. the prevalence
of triangles or other structures in the graph, regardless of
user), and (2) we allow for inexact matching (not necessarily
isomorphic graphs) to better account for the noisiness of human
communication patterns. We find that analysis of these classes
of cascades reveals new insights about information spread and
the influence of certain users, based on three large mobile phone
record datasets. For example, we find distinct groups of weekend
vs. workweek spreaders not evident in the standard aggregated
network. Finally, we create the communication network induced
by these persistent structures, and we show the effect this has
on measurements of centrality.
I. INTRODUCTION
A natural question to ask in the study of communication
in social networks is: do social networks exhibit a recurring
pattern of information spread? In this paper we propose
methods which indicate the answer may be yes. Specifically,
we present a method of extracting what appear to be the under-
lying communication structures from the “noisy” information
available in large-scale datasets.
We focus our attention on mobile phone records, also termed
call detail records (CDRs), because they provide a unique
opportunity to study the large-scale, unfiltered communication
patterns of individuals among their friends. Unfortunately, this
breadth of knowledge — in time, space, and demographics —
comes at the expense of depth, since we have no information
about the purpose or content of communication as we might
in social media or email records. Our approach attempts to
solve this problem by finding persistent patterns that strongly
imply meaningful communication is taking place.
A. Related work
A standard approach to translate raw communication data
into a meaningful network is to aggregate user activity over
some time period T (e.g. a week or month) into a static
graph. For example, we can require that a call is reciprocated
to consider two users social contacts (and assign them an
edge) as in [21], and choose T such that it gives some stable
representation (see [11]).
An alternative approach is to include temporal knowledge,
an interpretation broadly called temporal networks ([9]), which
often improves our understanding of structure and community
both at an aggregate and individual level. For example, in [19],
they observe that the change in a user’s frequent contacts over
time adheres to an apparent upper bound, or social capacity,
that stays relatively constant for a user even as his/her contacts
evolve.
The temporal approach seems especially appropriate in
the study of information spread, which is by nature causal
and time-dependent. Strong properties of human interaction
have emerged by including temporal information. One such
is the property of “burstiness” — that is, people tend to
communicate in short, active bursts followed by long periods
of inactivity. The tendency for non-Poissonian, heavy-tailed
inter-event communication times has been observed in many
contexts (for example, [27] studies email virus propagation,
[4] mobile phone communication, and [8] both mobile phone
and email), and shown to slow diffusion dynamics ([7], [24])
except under certain conditions ([18]).
A critical question in the study of information spread in
temporal networks is determining what (or if) information is
being spread during an observed communication event: is this
call/email/tweet random, social, information-related, etc. In
datasets like social media posts or email the answer is usually
obvious from the text content; for example, using Twitter hash-
tags as in [13], [15], [6]. However, in data like CDRs where we
only have the metadata of each event, a solution is not obvious.
In [1], they contrast the calling patterns immediately following
an emergency (bombing, earthquake) with the rest of the call
events, and find systematic differences in the timing and spread
of information. The implication is that we are more sure “real”
information spread is occurring following an emergency, and
therefore the contrast of patterns between this spread and what
we infer through a standard aggregated approach indicates the
latter is an inaccurate estimate.
This type of “cascading” information spread — i.e., a
single user initiating a call to a few contacts, who then call
several more, and so on — is of great interest in answering
our question of the communication event’s purpose, since
(broadly), a cascade implies non-random, or causal, action (see