An RDF Dataset Generator for the Social Network Benchmark with Real-World Coherence Mirko Spasi´ c 1,2 , Milos Jovanovik 1,3 , and Arnau Prat-P´ erez 4 1 OpenLink Software, UK 2 Faculty of Mathematics, University of Belgrade, Serbia 3 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University in Skopje, Macedonia 4 Sparsity Technologies, Spain {mspasic, mjovanovik}@openlinksw.com, arnau@sparsity-technologies.com Abstract. Synthetic datasets used in benchmarking need to mimic all characteristics of real-world datasets, in order to provide realistic bench- marking results. Synthetic RDF datasets usually show a significant dis- crepancy in the level of structuredness compared to real-world RDF datasets. This structural difference is important as it directly affects stor- age, indexing and querying. In this paper, we show that the synthetic RDF dataset used in the Social Network Benchmark is characterized with high-structuredness and therefore introduce modifications to the data generator so that it produces an RDF dataset with a real-world structuredness. Keywords: Data Generation, Social Network Benchmark, Synthetic Data, Linked Data, Big Data, RDF, Benchmarks. 1 Introduction Linked Data and RDF benchmarking require the use of real-world or synthetic RDF datasets [2]. For better benchmarking, synthetic RDF datasets need to comply with the general characteristics observed in their real-world counterparts, such as the schema, structure, size, distributions, etc. As the authors of [3] show, synthetic RDF datasets used in benchmarking exhibit a significant structural difference from real datasets: real-world RDF datasets are less coherent, i.e. have a lower degree of structuredness than synthetic RDF datasets. This structural difference is important as it has direct consequences on the storage of the data, as well as on the ways the data are indexed and queried. In order to create the basis for more realistic RDF benchmarking, we mod- ify the existing RDF data generator for the Social Network Benchmark so that the resulting synthetic dataset follows the structuredness observed in real-world RDF datasets. Additionally, we implement the coherence measurement for RDF datasets as a Virtuoso procedure, to simplify the process of measuring the struc- turedness of any given RDF dataset by using the RDF graph stored in the quad store, instead of using RDF files. * The presented work was funded by the H2020 project HOBBIT (#688227).