L EMMING – Example-based Mimicking of Knowledge Graphs Michael R¨ oder DICE group Department of Computer Science Paderborn University, Germany; Institute for Applied Informatics Leipzig, Germany Pham Thuy Sy Nguyen and Felix Conrads and Ana Alexandra Morim da Silva Department of Computer Science Paderborn University, Germany Axel-Cyrille Ngonga Ngomo DICE group Department of Computer Science Paderborn University, Germany; Institute for Applied Informatics Leipzig, Germany Abstract—The size of knowledge graphs used in real appli- cations grows constantly. Predicting the performance of storage solutions for knowledge graphs w.r.t. their query performance is hence of central performance for the practical use of said storage solutions. We address this challenge by learning graph invariants of a given graph. We then use these invariants to fuel a stochastic generation model that is able to generate graphs of arbitrary sizes similar to the input graph. We evaluate our graph generator, dubbed LEMMING, by comparing the performance of storage solutions on synthetic and real versions of three datasets of up to 3.4 × 10 6 triples. Our results suggest that the performance of storage solutions on synthetic data generated by LEMMING reflects their performance on real data in most of the cases. The source code of LEMMING is available at https://github.com/dice-group/Lemming. I. I NTRODUCTION The size of the knowledge graphs (KGs) used in real appli- cations grows constantly. For example, the Google Knowledge Graph grew from 3.5 × 10 9 facts to 18 × 10 9 facts in 7 months. The authors of [1] point out that comparable growth rates can be observed in KGs of other large companies. The same phenomenon is also present in open data sets. For example, DBpedia crossed the mark of 23 × 10 9 triples in 2017 1 while it begun with 0.1 × 10 9 triples in 2007 [2]. The ranking of corresponding storage solutions w.r.t. their runtime performance (measured in query mixes per hour, short QMpH) has been observed to change with the size of the KGs [3], [4]. For example, the authors of [3] report the performance of four triple stores on different versions of an RDF KG ranging from 10 5 to 10 6 triples. While BlazeGraph Free version 8.5 achieves the second-best performance on the smallest version of the KG, it achieves the worst performance on the subsequent version of the KG, which is merely five times larger. Similar insights can be derived from [4], where Jena TDB version 2.3.0 is ranked first across three triple stores and achieves the best QMpH performance on a 10% fragment of DBpedia version 2016-10 but is outperformed by Virtuoso version 7.0.0 on the full version of the same dataset and even achieves the worst QMpH performance in some high-load settings with 16 concurrent queries. 1 https://wiki.dbpedia.org/develop/datasets/dbpedia-version-2016-10 Given that the performance of triple stores changes across dataset versions, there is a need to predict the future per- formance of storage solutions given existing versions of a dataset. Such a prediction can facilitate the deployment of reliable KG infrastructures, the timely acquisition and alter- ation of software components and the maintenance of quality- of-service requirements, e.g., in terms of minimal QMpH performance. In this work, we hence address the challenge of predicting the performance and ranking of storage solutions on a (future) version of a knowledge graph of size k (measured in numbers of nodes) given previous versions of the same dataset. This task differs from that addressed by current RDF generators, which assume a particular dataset or ontology (e.g., universities) and generate data based thereupon [5], [6], [7], [8], [9], [10]. We formalize the problem as follows: Given versions K 1 ,...,K ν of a KG (e.g., WikiData, DBpedia, MusicBrainz), we aim to learn a synthetic dataset generator K which allows the prediction of the performance and ranking of storage solutions (w.r.t. their performance measured in QMpH and in queries per second, short QpS) on a version of K of size k. We use K(k) to denote the KG of size k generated by K. To learn K, we use training data in form of K = {K 1 ,...,K ν } to learn graph-specific invariants, which we define as functions g whose with a low variance on K and a high variance on other sets of graphs disjoint from K . We learn these functions using a refinement operator ρ for arithmetic functions. We show that our operator is finite, redundant and complete. Our experiments show that our approach is able to learn generators K which generate datasets with which the ranking on real datasets can be approximated with a root mean squared error on ranks under 0.15. II. RELATED WORK The generation of synthetic graphs that can mimic real- world graphs is an important field of research. Starting from the Erd¨ os-Renyi model [11], several further models have been developed. The Watts-Strogatz model [12] is able to create random graphs with small-world properties. The Barabasi- Albert model [13] is able to create scale-free graphs similar to the link graph of the World Wide Web. However, all these