Behavioral Clusters in Dynamic Graphs James Fairbanks * , Ramakrishnan Kannan, Haesun Park, David A. Bader School of Computational Science and Engineering Georgia Institute of Technology Abstract This paper contributes a method for combining sparse parallel graph algorithms with dense parallel linear algebra algorithms in order to understand dynamic graphs including the temporal behavior of vertices. Our method is the first to cluster vertices in a dynamic graph based on arbitrary temporal behaviors. In order to successfully implement this method, we develop a feature based pipeline for dynamic graphs and apply Nonnegative Matrix Factorization (NMF) to these features. We demonstrate these steps with a sample of the Twitter mentions graph as well as a CAIDA network traffic graph.We contribute and analyze a parallel NMF algorithm presenting both theoretical and empirical studies of performance. This work can be leveraged by graph/network analysts to understand the temporal behavior cluster structure and segmentation structure of dynamic graphs. Keywords: dynamic graph analysis, streaming, matrix factorization, nonnegative matrix factorization (NMF), behavioral clusters, low rank approximation 1. Introduction There are many domains of data analysis that can be modeled with the graph abstraction. In particular we are interested in social networks and internet connection networks. These networks are collections of in- teractions occurring in complex patterns. Analyzing these patterns is essential to leveraging the information contained in these networks. Because the most important networks are the networks that are in heavy use right now, methods to understand temporal patterns in dynamic networks are important. The availability of big data has driven an adoption of large scale statistical techniques, both classical and modern. These techniques are not immediately applicable to graph data and this leaves analysts separated from their familiar software tools. In order to connect graph analysis and statistical reasoning, we introduce vertex features which can be calculated efficiently and then analyzed using familiar large scale statistical software tools. This connection is bidirectional because statistical analysis of vertex features informs the computation of additional features. The observed difficulty of writing scalable parallel graph algorithms for scale-free and irregular graphs advises against writing inferential and mathematical code to analyze the graphs directly. In this paper we address this gap by first applying non-inferential graph code to generate vectorial data that is statistically well behaved, then applying a state of the art vectorial technique to this data, which provides insight into the original graph. A representation of this framework is presented in Figure 1 In the massive streaming data analytics model [11], we view the graph of network events as an unending stream of new edge updates. For each interval of time, we have the static graph, which represents the previous state of the network, and a sequence of edge updates that represent the events since the previous * corresponding author Email addresses: james.fairbanks@gatech.edu (James Fairbanks), rkannan@gatech.edu (Ramakrishnan Kannan), hpark@cc.gatech.edu (Haesun Park), bader@cc.gatech.edu (David A. Bader) Submitted to Parallel Computing October 3, 2014 © 2015. This manuscript version is made available under the Elsevier user license http://www.elsevier.com/open-access/userlicense/1.0/