QLever: A ery Engine for Efficient SPARQL+Text Search Hannah Bast University of Freiburg 79110 Freiburg, Germany bast@cs.uni-freiburg.de Bj¨ orn Buchhold University of Freiburg 79110 Freiburg, Germany buchhold@cs.uni-freiburg.de ABSTRACT We present QLever, a query engine for efficient combined search on a knowledge base and a text corpus, in which named entities from the knowledge base have been identified (that is, recognized and disambiguated). e query language is SPARQL extended by two QLever-specific predicates ql:contains-entity and ql:contains-word, which can express the occurrence of an entity or word (the object of the predicate) in a text record (the subject of the predicate). We eval- uate QLever on two large datasets, including FACC (the ClueWeb12 corpus linked to Freebase). We compare against three state-of-the- art query engines for knowledge bases with varying support for text search: RDF-3X, Virtuoso, Broccoli. ery times are competitive and oſten faster on the pure SPARQL queries, and several orders of magnitude faster on the SPARQL+Text queries. Index size is larger for pure SPARQL queries, but smaller for SPARQL+Text queries. CCS CONCEPTS Information systems Database query processing; ery planning; Search engine indexing; Retrieval efficiency; KEYWORDS SPARQL+Text; Efficiency; Indexing 1 INTRODUCTION is paper is about efficient search in a knowledge base combined with text. For the purpose of this paper, a knowledge base is a collection of subject-predicate-object triples, where consistent iden- tifiers are used for the same entities. For example, here are three triples from Freebase 1 , the world’s largest open general-purpose knowledge base, which we also use in our experiments: <Neil Armstrong><is-a><Astronaut> <Neil Armstrong><nationality><American> <Neil Armstrong><books-wrien> ”First on the moon” A knowledge base enables queries that express the search intent precisely. For example, using SPARQL (the de facto standard query 1 In our examples, we actually use Freebase Easy [4], a sanitized version of Freebase with human-readable entity names. In the original Freebase, entity identifiers are alphanumeric, and human-readable names are available via an explicit name predicate. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. CIKM’17, November 6–10, 2017, Singapore. © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4918-5/17/11. . . $15.00 DOI: hps://doi.org/10.1145/3132847.3132921 language for knowledge bases), we can easily search for all astro- nauts and their nationalities as follows: SELECT ?x ?y WHERE { ?x <is-a><Astronaut> . ?x <nationality> ?y } ORDER BY ASC(?x) LIMIT 100 e result is a flat list of tuples ?x ?y, where ?x is an astronaut and ?y their nationality. e ORDER BY ASC(?x) clause causes the results to be listed in ascending (lexicographic) order. e LIMIT 100 clause limits the result to the first 100 tuples. Note that if an astronaut has k nationalities, they would contribute k tuples to the result. Also note that the triples in the body of the query can contain variables which are not specified as an argument of the SELECT operator and which are hence not shown in the result. Keyword search in object strings. Knowledge bases can have arbitrary string literals as objects. See the third triple in the ex- ample above, where the object names the title of a book. SPARQL allows regular expression matches for such literals. Commercial SPARQL engines also offer keyword search in literals. For Virtu- oso (described in Section 2), this is realized via a special predicate bif:contains, where the bif prefix stands for built-in function. For example, the following query searches for astronauts who have wrien a book with the words first and moon in the title: SELECT ?x ?y WHERE { ?x <is-a><Astronaut> . ?x <books-wrien> ?w . ?w bif:contains “first AND moon” } Fully combined SPARQL+Text search. For our query engine, we consider the following deeper integration of a knowledge base with text. We assume that the text is given as a separate corpus, and that named entity recognition and disambiguation (of the entities from the knowledge base in the text) has been performed. at is, each mention of an entity of the knowledge base in the text has been annotated with the unique ID of that entity in the knowledge base. For example, the well-known FACC [13] dataset (which we also use in our experiments in Section 5) provides such an annota- tion of the ClueWeb12 corpus with the entities from Freebase. Here is an example sentence from ClueWeb12 with one recognized entity from Freebase (note how the entity is not necessarily referred to with its full name in the text): On July 20, 1969, Armstrong<Neil Armstrong> became the first hu- man being to walk on the moon With a knowledge base and a text corpus linked in this way, queries of the following kind are possible: