QLever: A ery Engine for Efficient SPARQL+Text Search
Hannah Bast
University of Freiburg
79110 Freiburg, Germany
bast@cs.uni-freiburg.de
Bj¨ orn Buchhold
University of Freiburg
79110 Freiburg, Germany
buchhold@cs.uni-freiburg.de
ABSTRACT
We present QLever, a query engine for efficient combined search on
a knowledge base and a text corpus, in which named entities from
the knowledge base have been identified (that is, recognized and
disambiguated). e query language is SPARQL extended by two
QLever-specific predicates ql:contains-entity and ql:contains-word,
which can express the occurrence of an entity or word (the object of
the predicate) in a text record (the subject of the predicate). We eval-
uate QLever on two large datasets, including FACC (the ClueWeb12
corpus linked to Freebase). We compare against three state-of-the-
art query engines for knowledge bases with varying support for text
search: RDF-3X, Virtuoso, Broccoli. ery times are competitive
and oſten faster on the pure SPARQL queries, and several orders of
magnitude faster on the SPARQL+Text queries. Index size is larger
for pure SPARQL queries, but smaller for SPARQL+Text queries.
CCS CONCEPTS
•Information systems →Database query processing; ery
planning; Search engine indexing; Retrieval efficiency;
KEYWORDS
SPARQL+Text; Efficiency; Indexing
1 INTRODUCTION
is paper is about efficient search in a knowledge base combined
with text. For the purpose of this paper, a knowledge base is a
collection of subject-predicate-object triples, where consistent iden-
tifiers are used for the same entities. For example, here are three
triples from Freebase
1
, the world’s largest open general-purpose
knowledge base, which we also use in our experiments:
<Neil Armstrong><is-a><Astronaut>
<Neil Armstrong><nationality><American>
<Neil Armstrong><books-wrien> ”First on the moon”
A knowledge base enables queries that express the search intent
precisely. For example, using SPARQL (the de facto standard query
1
In our examples, we actually use Freebase Easy [4], a sanitized version of Freebase
with human-readable entity names. In the original Freebase, entity identifiers are
alphanumeric, and human-readable names are available via an explicit name predicate.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
CIKM’17, November 6–10, 2017, Singapore.
© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4918-5/17/11. . . $15.00
DOI: hps://doi.org/10.1145/3132847.3132921
language for knowledge bases), we can easily search for all astro-
nauts and their nationalities as follows:
SELECT ?x ?y WHERE {
?x <is-a><Astronaut> .
?x <nationality> ?y
} ORDER BY ASC(?x) LIMIT 100
e result is a flat list of tuples ?x ?y, where ?x is an astronaut and ?y
their nationality. e ORDER BY ASC(?x) clause causes the results
to be listed in ascending (lexicographic) order. e LIMIT 100 clause
limits the result to the first 100 tuples. Note that if an astronaut has
k nationalities, they would contribute k tuples to the result. Also
note that the triples in the body of the query can contain variables
which are not specified as an argument of the SELECT operator
and which are hence not shown in the result.
Keyword search in object strings. Knowledge bases can have
arbitrary string literals as objects. See the third triple in the ex-
ample above, where the object names the title of a book. SPARQL
allows regular expression matches for such literals. Commercial
SPARQL engines also offer keyword search in literals. For Virtu-
oso (described in Section 2), this is realized via a special predicate
bif:contains, where the bif prefix stands for built-in function. For
example, the following query searches for astronauts who have
wrien a book with the words first and moon in the title:
SELECT ?x ?y WHERE {
?x <is-a><Astronaut> .
?x <books-wrien> ?w .
?w bif:contains “first AND moon”
}
Fully combined SPARQL+Text search. For our query engine,
we consider the following deeper integration of a knowledge base
with text. We assume that the text is given as a separate corpus, and
that named entity recognition and disambiguation (of the entities
from the knowledge base in the text) has been performed. at is,
each mention of an entity of the knowledge base in the text has
been annotated with the unique ID of that entity in the knowledge
base. For example, the well-known FACC [13] dataset (which we
also use in our experiments in Section 5) provides such an annota-
tion of the ClueWeb12 corpus with the entities from Freebase. Here
is an example sentence from ClueWeb12 with one recognized entity
from Freebase (note how the entity is not necessarily referred to
with its full name in the text):
On July 20, 1969, Armstrong<Neil Armstrong> became the first hu-
man being to walk on the moon
With a knowledge base and a text corpus linked in this way, queries
of the following kind are possible: