GRAFT: An Efficient Graphlet Counting Method for Large Graph Analysis Mahmudur Rahman, Mansurul Alam Bhuiyan, and Mohammad Al Hasan Abstract—Majority of the existing works on network analysis study properties that are related to the global topology of a network. Examples of such properties include diameter, power-law exponent, and spectra of graph Laplacian. Such works enhance our understanding of real-life networks, or enable us to generate synthetic graphs with real-life graph properties. However, many of the existing problems on networks require the study of local topological structures of a network, which did not get the deserved attention in the existing works. In this work, we use graphlet frequency distribution (GFD) as an analysis tool for understanding the variance of local topological structure in a network; we also show that it can help in comparing, and characterizing real-life networks. The main bottleneck to obtain GFD is the excessive computation cost for obtaining the frequency of each of the graphlets in a large network. To overcome this, we propose a simple, yet powerful algorithm, called GRAFT, that obtains the approximate graphlet frequency for all graphlets that have up-to five vertices. Comparing to an exact counting algorithm, our algorithm achieves a speedup factor between 10 and 100 for a negligible counting error, which is, on average, less than 5 percent. Index Terms—Graph mining, graphlet counting, GFD Ç 1 INTRODUCTION S TRUCTURAL analysis of networks is an important research task that has received the due attention by researchers in various disciplines, such as social sciences [1], system sciences [2], and bioinformatics [3]. Such analyses lead to the discovery of various non-random properties in large, real-life networks; examples include scale-free-ness [4], small diameter [5], graph densification with shrinking diameter [6] and spectral analysis [7]. Various graph generation models are also discovered for generating synthetic graphs having properties alike to the real-life graphs. However, majority of the existing works consider only the global properties of a network, but some recent works [3], [8] have shown that the con- centration of local topological structure is also important for network comparison and modeling. In a large network, a sketch of the local structure can be obtained by collecting the topological context in which each of the nodes resides. However, finding this informa- tion is typically an expensive task and the cost grows exponentially with the size of the local context. So, it is not surprising that only a few works have studied the local topological structures extensively. Most notable among these are probably a series of works [3], [8] by N. Przulj’s group. In these works, the authors first find the position of each node in a graphlet-space, 1 which they use for solving various tasks that are related to biological networks. Some of these tasks are: compare structures of different biologi- cal networks [3], characterize biological networks using graphlet degree distribution [3], and obtain a structural to functional mapping for biological networks [8]. However, these works consider small networks, and the analysis methods that are proposed in these works are not scalable to large networks. In this work, we consider the task of counting graphlets in a large network. The main motivation of our work is building a fingerprint, called graphlet frequency distribu- tion (GFD). GFD is a vector, which can be used to compare the frequencies of various graphlets for analyzing a large graph. Real-life networks are sparse, and in such networks the frequencies of larger-sized graphlets shrink in exponen- tial proportion; hence, GFD uses logarithmic scale for the frequency comparison so that the contribution of larger- sized graphlets are fairly accounted. In constructing GFD, we limit the counting task for graphlets that have upto five vertices (shown in Fig. 1). For a justification of the restriction on graphlet size, we refer the reader to the Table 1; data from this table show that the number of possible (undirected) graphlets grows exponen- tially 2 with the number of vertices. The table also shows that there are 112 graphlets with six vertices in comparison to 21 graphlets with five vertices. With that many choices, the frequencies of all (except the line graphlet and a few tree The authors are with the Department of Computer and Info. Science, Indiana University–Purdue University (IUPUI), 723 W. Michigan St. Indianapolis, IN 46202. E-mail: {mmrahman, mbhuiyan}@iupui.edu, alhasan@cs.iupui.edu. Manuscript received 26 May 2013; revised 15 Nov. 2013; accepted 11 Dec. 2013. Date of publication 8 Jan. 2014; date of current version 29 Aug. 2014. Recommended for acceptance by A. Singh. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2013.2297929 1. A graphlet is a small connected non-isomorphic induced sub- graph of the given network. We provide a formal definition of graphlet in subsequent section. Also note, in [9], the term “graphlet” has also been used for describing wavelet decomposition of graphs, our work is not related to this definition. 2. This growth is larger than the growth of a Fibonacci sequence. 2466 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 10, OCTOBER 2014 1041-4347 ß 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.