Noname manuscript No. (will be inserted by the editor) ExactSim: Benchmarking Single-Source SimRank Algorithms with High-Precision Ground Truths Hanzhi Wang · Zhewei Wei · Yu Liu · Ye Yuan · Xiaoyong Du · Ji-Rong Wen Received: date / Accepted: date Abstract SimRank is a popular measurement for eval- uating the node-to-node similarities based on the graph topology. In recent years, single-source and top-k Sim- Rank queries have received increasing attention due to their applications in web mining, social network anal- ysis, and spam detection. However, a fundamental ob- stacle in studying SimRank has been the lack of ground truths. The only exact algorithm, Power Method, is computationally infeasible on graphs with more than 10 6 nodes. Consequently, no existing work has evalu- ated the actual accuracy of various single-source and top-k SimRank algorithms on large real-world graphs. In this paper, we present ExactSim, the first algo- rithm that computes the exact single-source and top-k Zhewei Wei is the corresponding author. Hanzhi Wang School of Information, Renmin University of China, China E-mail: hanzhi wang@ruc.edu.cn Zhewei Wei Gaoling School of Artificial Intelligence, Renmin University of China, China E-mail: zhewei@ruc.edu.cn Yu Liu Wangxuan Institute of Computer Technology, Peking Univer- sity, China E-mail: dokiliu@pku.edu.cn Ye Yuan School of Computer Science and technology, Beijing Institute of Technology, China E-mail: yuan-ye@bit.edu.cn Xiaoyong Du MOE Key Lab DEKE, Renmin University of China, China E-mail: duyong@ruc.edu.cn Ji-Rong Wen Beijing Key Lab of Big Data Management and Analysis Method, Renmin University of China, China E-mail: jrwen@ruc.edu.cn SimRank results on large graphs. This algorithm pro- duces ground truths with precision up to 7 decimal places with high probability. With the ground truths computed by ExactSim, we present the first experi- mental study of the accuracy/cost trade-offs of existing approximate SimRank algorithms on large real-world graphs and synthetic graphs. Finally, we use the ground truths to exploit various properties of SimRank distri- butions on large graphs. Keywords SimRank, Single-Source, Exact computa- tion, Ground truths, Power-Law, Benchmarks 1 Introduction Computing link-based similarity is an overarching prob- lem in graph analysis and mining. Amid the existing similarity measures [31, 40, 49, 48], SimRank has emerged as a popular metric for assessing structural similarities between nodes in a graph. SimRank was introduced by Jeh and Widom [13] to formalize the intuition that “two pages are similar if they are referenced by simi- lar pages.” Given a directed graph G =(V,E) with n nodes {v 1 ,...,v n } and m edges, the SimRank matrix S defines the similarity between any two nodes v i and v j as follows: S(i, j )= 1, for i = j ; v i ′ ∈I(vi) v j ′ ∈I(vj ) c · S(i ′ ,j ′ ) d in (v i ) · d in (v j ) , for i = j . (1) Here, c is a decay factor typically set to 0.6 or 0.8 [13, 26]. I (v i ) denotes the set of in-neighbors of v i , and d in (v i ) denotes the in-degree of v i . SimRank aggre- gates similarities of multi-hop neighbors of v i and v j to