Queryable Compression on Streaming Social Networks Michael Nelson & Sridhar Radhakrishnan School of Computer Science University of Oklahoma Norman, OK, USA {Michael.A.Nelson-1, sridhar}@ou.edu Amlan Chatterjee Department of Computer Science California State University Dominguez Hills Carson, CA, USA achatterjee@csudh.edu Chandra N. Sekharan Department of Computer Science Loyola University Chicago Chicago, IL, USA chandra@cs.luc.edu Abstract—The social networks of today are a set of massive, dynamically changing graph structures. Each of these graphs contain a set of nodes (individuals) and a set of edges among the nodes (relationships). The choice of representation of a graph determines what information is easy to obtain from it. However, many social network graphs are so large that even their basic representations (e.g. adjacency lists) do not fit in main memory. Hence an ongoing field of study has focused on designing compressed representations of graphs that facilitate certain query functions.This work is based on representing dynamic social networks that we call streaming graphs where edges stream into our compressed representation. The crux of this work is the use of a novel data structure for streaming graphs that is based on an indexed array of compressed binary trees that builds the graph directly without using any temporary storage structures. We provide fast access methods for edge existence (does an edge exist between two nodes?), neighbor queries (list a node’s neighbors), and streaming operations (add/remove nodes/edges). We test our algorithms on public, anonymized, massive graphs such as Friendster, LiveJournal, Pokec, Twitter, and others. Our empirical evaluation is based on several parameters including time to compress, memory required by the compression algorithm, size of compressed graph, and time to execute queries. Our experimental results show that our current approach outperforms previous approaches in various key respects such as compression time, compression memory, compression ratio, and query execution times and hence the best to date overall. Index Terms—Graph Compression, Binary Tree, Online Social Networks, Streaming Graphs I. I NTRODUCTION Given a social network, it can be represented as a graph G =(V,E), where V is a set of nodes (individuals) and E is a set of edges (relationships). Most social networks like Facebook are undirected, meaning that the relationship is automatically reciprocated. In contrast, a social network like Twitter is directed, as meant by their concept of ’following’. Clearly, we can see that knowledge learned from these graphs is beneficial, as it may help to better coordinate events, suggest friends, advertise, and recommend games. Social networks are ever growing. For example, from De- cember 2014 to March 2015, the number of daily active Facebook users grew from 890 million to 963 million [1]. Social networks are not limited to the number of people in the world, since entities such as companies and communities may form an account as a new node. Clearly, such large, streaming graphs present a challenge to social network analysis. Many different queries may be run on social networks. When developing a queryable compression technique, the compressed structure is usually designed to be efficient with a specific set of queries [7]. The most popular of these are arguably community operations and the reachability query. When building the algorithms used to answer these queries, the neighbor enumeration and edge existence operations are put to heavy use. This is true for many other classes of problems, including network pattern mining and friend suggestion. Consider our Friendster snapshot with n = 65608366 nodes and m = 1806067135 edges. Using a boolean adjacency ma- trix representation, we get a size of 65608366 2 bits = 538TB. Assuming 64-bit pointers and an adjacency list representation the memory needed can be estimated to be about 41 gigabytes which exceeds the typical RAM size of most computers. Queries such as neighbor enumeration and edge existence can be time-consuming in such high memory environments. These queries have time complexities of O(n) and O(1), respectively, on the adjacency matrix and O(σ(n)) on the adjacency lists, where σ(n)) is the degree of the graph. However, if the structure does not fit in memory it must make access calls to disk, which incur a high time penalty. Given this, our desire is to compress the graph to a size that can fit in the main memory but also provide mechanisms to perform neighbor and edge queries directly on the compressed structure. It is worth pointing out that for graphs represented in distributed memories, our compression techniques can be easily extended. Most raw, uncompressed graphs are downloaded from vari- ous sources as plain text files. These files are merely the graph in edge list form. That is, each line consists of two numbers, u and v, separated by a space. A common requirement for most compression algorithms is an intermediate structure, such as an adjacency list, that is built from this edge list and used to efficiently build a final compressed structure [10]. Since we can incrementally build our compression, we do not require such an intermediate structure. For obvious reasons, the original edge list text files are stored with common compression programs such as gzip. For large graphs like our Friendster graph, this is at least 41GB of necessary memory just in the preprocessing stage.