Benchmarking OWL Reasoners J¨ urgen Bock FZI Research Center Karlsruhe, Germany bock@fzi.de Peter Haase Institute AIFB, University of Karlsruhe 76131 Karlsruhe, Germany haase@aifb.uni-karlsruhe.de Qiu Ji Institute AIFB, University of Karlsruhe 76131 Karlsruhe, Germany qiji@aifb.uni-karlsruhe.de Raphael Volz FZI Research Center Karlsruhe, Germany volz@fzi.de ABSTRACT The growing popularity of semantic applications makes scal- ability of ontology reasoning tasks increasingly important. In this work, we firstly analyze the ontology landscape on the web, and identify typical clusters of expressivity. Sec- ondly, we benchmark current ontology reasoners, by using representative ontologies for each cluster and a comprehen- sive set of queries. We point out applicability of specific reasoners to certain expressivity clusters. 1. INTRODUCTION Semantic applications based on ontologies have become increasingly important in recent years. Yet, scalability re- mains one of the major obstacles in leveraging the full power of using ontologies for practical applications. Reasoning with OWL ontologies has high worst complexity, but in- deed many large scale applications normally only use frag- ments of OWL that are rather shallow in logical terms and do not require sophisticated reasoning algorithms. Survey- ing the landscape of existing ontologies, we observe a broad spectrum of ontologies that differ in terms of size, complex- ity and their ratio between terminological and factual as- sertions. Keet and Rodr´ıguez [8] point out that there is a high demand for either very expressive ontology languages to represent rather complete knowledge, or less expressive on- tology languages, which are more tractable w.r.t. reasoning or other computational tasks. This issue is often called the computational cliff and leads to the problem, that often only small fragments of ontology languages are exploited, where reasoners are used that are optimized to a larger number of features, and in particular those, that are actually not used in the fragment. Today, a number of different reasoners are available that are based on quite different design decisions in addressing the tradeoff between complexity and expres- siveness on the one hand and scalability on the other hand: Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB ‘07, September 23-28, 2007, Vienna, Austria. Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09. Classical description logic reasoners based on tableaux al- gorithms are able to classify large, expressive ontologies as often found for example in the bio-medical domain, but they often provide limited support in dealing with large number of instances. Database-like reasoners that materialize in- ferred knowledge upfront are able to handle large amounts of assertional facts, but are in principle limited in terms of the logic they are able to support. Deciding for an appropri- ate reasoner for a given application task is far from trivial. In order to support such decisions, comparisons of reasoners based on benchmarks are required. While a number performance evaluations for OWL rea- soners have already been performed in the past, all of them so far targeted only special purpose tasks, e.g. focussing ei- ther on classical description logic reasoning tasks [2], or on answering conjunctive queries over large knowledge bases based on rather inexpressive ontologies [6]. In our work we aim to go a step further and intend to provide guidance for selecting the appropriate reasoner for a given application scenario. In order to so, we provide a survey of the ontology landscape, discuss typical reasoning tasks and define a com- prehensive benchmark. Based on the benchmark results we identify which reasoners are most adequate for which classes of ontologies and corresponding reasoning tasks. The paper is organized as follows: In Section 2 we discuss related work on benchmarking OWL reasoners. Based on an overview of the ontology landscape and relevant language fragments provided in Section 3, we define our benchmark in Section 4. In Section 5, we give a description of the reasoners selected for our benchmark. In Section 6 we report on the experiments performed. We conclude with an outlook to future work in Section 7. 2. RELATED WORK With the availability of practical reasoners for OWL, a number of benchmarks for evaluating and comparing OWL reasoners have been proposed. The first one – the Lehigh University Benchmark (LUBM) – was proposed by Guo et al. in [6]. LUBM concentrates on the reasoning task of an- swering conjunctive queries over an OWL Lite ontology with an ABox of varying size. It was later pointed out that – while the ontology itself is in OWL Lite – answering the proposed queries does not require OWL Lite reasoning, but instead can be performed by realizing the ABox, i.e. computing the most specific concept(s) that each individual is an in-