Intelligent Data Analysis 22 (2018) 261–295 261 DOI 10.3233/IDA-163319 IOS Press List sampling for large graphs Muhammad Irfan Yousuf a,b Suhyun Kim a,b,∗ a Human Computer Interaction and Robotics, University of Science and Technology, Daejeon, Korea b Imaging Media Research Center, Korea Institute of Science and Technology, Seoul, Korea Abstract. Real world graphs are massive in size and often prohibitively expensive to analyze. Of the possible solutions, sam- pling is extracting a representative subgraph from a large graph that faithfully represents the actual graph. The prior research has developed several sampling methods but the samples produced by these methods fail to match important properties of the original graph and work poorly in maintaining its topology. We observed that the existing methods do not explore the neigh- borhood of sampled nodes fairly and hence yield suboptimal samples. In this paper, we introduce a novel approach in which we keep a list of candidate nodes that is populated with all the neighbors of nodes that have been sampled so far. With this approach, we can balance the depth and breadth of graph exploration to produce better samples. We evaluate the effectiveness of our approach using several real world datasets and show that it surpasses the existing state-of-the-art approaches in main- taining the properties of the original graph and retaining its structure. We also calculate Kolmogorov-Smirnov Distance and Jensen-Shannon Distance for quantitative evaluation of our approach. Keywords: Graph sampling, big graphs, social network analysis 1. Introduction We are surrounded by networks all around us including technological networks (e.g., the Internet, telephone networks), social networks (e.g., Facebook, Twitter), information networks (e.g., World Wide Web, citation networks), biological networks (e.g., biochemical networks, food webs) and many more. The network graphs or simply graphs, a formal representation of these networks, provide us a structural model that makes it possible to analyze and understand different properties of a network. By analyzing the graphs, we extract valuable information that could help us to make not only business-oriented deci- sions but also technology-oriented decisions. When it comes to the graph analysis, it would be simple if the graph size is small, but unfortunately real world graphs are too massive to efficiently manage and analyze. The real world graphs could have billions of nodes and edges and fully analyzing such graphs requires lots of resources in terms of required memory, computational power and the processing time, making it prohibitively expensive to study such enormous graphs. Sampling small subgraphs from large graphs is one of the possible solutions, provided that the small subgraph is a good representation of the large graph. The sampled subgraph is supposed to maintain the properties of the graph (e.g., degree, clustering coefficient, path length etc.) and match the distributions of these properties as measured in the actual graph. Given a large graph G = (V, E), a sampling method selects a subset of nodes (V s ⊂ V ) and edges (E s ⊂ E) to form a subgraph G s =(V s ,E s ). In accordance with the previous work [1,2], we assume that a good representative subgraph G s of graph G is one * Corresponding author: Suhyun Kim, Imaging Media Research Center, Korea Institute of Science and Technology, Seoul 02792, Korea. Tel.: +82 2 958 5114; Fax: +82 2 958 5769; E-mail: suhyun_kim@kist.re.kr. 1088-467X/18/$35.00 c 2018 – IOS Press and the authors. All rights reserved