A New Algorithmic Model for Graph Analysis of Streaming Data Chunxing Yin Georgia Institute of Technology cyin9@gatech.edu Jason Riedy Georgia Institute of Technology jason.riedy@cc.gatech.edu David A. Bader Georgia Institute of Technology bader@cc.gatech.edu Abstract The constant and massive infux of new data into analy- sis systems needs to be addressed without assuming we can pause the onslaught. Here we consider one aspect: non- stop graph analysis of streaming data. We formalize a new and practical algorithm model that includes both single-run analysis as well as efciently updating analysis results only around changed data. In our model, a massive graph under- goes changes from an input stream of edge insertions and removals. These changes occur concurrently with analysis. Algorithms do not pause or stop the input stream. Assum- ing basic data access safety, we consider an algorithm valid for our model if the output is correct for a graph consisting of the initial graph and some implicit subset of concurrent changes. Our technical contributions include 1. the frst formal model for graph analysis with concurrent changes, 2. prop- erties of the model including how our model is the strongest possible without point-in-time graph views, 3. demonstra- tions of our model on connected components and PageRank, and 4. an extension to updating results incrementally. CCS Concepts · Theory of computation Dynamic graph algorithms; · Information systems Network data models; Data streams; ACM Reference Format: Chunxing Yin, Jason Riedy, and David A. Bader. 2018. A New Al- gorithmic Model for Graph Analysis of Streaming Data. In Pro- ceedings of 14th International Workshop on Mining and Learning with Graphs (MLG’18). ACM, New York, NY, USA, 8 pages. htps: //doi.org/10.nnnn/nnnnnnn.nnnnnnn 1 Introduction Applications in felds like computer network security and social media analyze an ever-changing environment. The data is rich in relationships and lends itself to graph analysis. Network security applications analyze nearly one million events per second [21] to shut down threats immediately. So- cial networks use the relationships in over 140 thousand łtweetsž per second[20] to over 510 thousand comments per second[23] to fnd the best advertisements for one’s current needs or desires. MLG’18, 20 August, 2018, London, United Kingdom 2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 htps://doi.org/10.nnnn/nnnnnnn.nnnnnnn There are many computational models and software frame- works that address updating graph analysis results given a starting result and point-in-time or snapshot views of the changing graph, see Section 3 and our STINGER[8] frame- work. To our knowledge none of these address computing that initial, starting analysis result without stopping the world to provide an initial snapshot view. On graphs with billions of vertices and tens to hundreds of billions of edges, providing a snapshot view for the initialization imposes a large performance cost on the entire analysis system, afect- ing not only the initializing query but also all concurrently running analyses. For example, STINGER can ingest ten mil- lion graph updates per second[8]. The updating kernels peak at hundreds of thousands[7] to around a million[13] updates per second. Initializing those kernels requires time propor- tional to the graph size[14] and not the update size. These limit the peak performance of the system as a whole. We address the initialization problem by presenting the frst formal model for graph analysis on streaming data in which algorithms run concurrently with graph updates (Sec- tion 4). Our model also applies to updating analysis results (Section 8). Analysis updates occur concurrently with changes, permitting many more simultaneous analysis clients on a sin- gle massive graph. This is an extreme model for extreme rates that assumes only memory consistency. We are interested in possibilities without fne-grained locking or versioning to enable far more simultaneous applications referencing the same massive graph store. Not all algorithms are appropriate for our execution model (Section 2), but some core algorithms like breadth-frst search work if we consider the algorithms as traversing a graph that consists of the starting graph plus some implicit subset of the concurrent graph changes. We consider this result valid for our model (not incorrect!) as defned formally in Section 4. This paper’s contributions are as follows: We provide a formal algorithmic model for applying graph analysis algorithms to graphs being updated concurrently from a live stream (Section 4). Algorithms considered valid may not require large-scale copying of the graph nor pausing the data stream. We prove multiple properties of our model. Invalid algorithms can be demonstrated with a single change (Corollary 5.5), and algorithms that produce subgraphs of their inputs (e.g. tree construction) cannot be proven