Cohesive Keyword Search on Tree Data Aggeliki Dimitriou School of Electrical and Computer Engineering National Technical University of Athens, Greece angela@dblab.ntua.gr Ananya Dass Department of Computer Science New Jersey Institute of Technology, USA ad292@njit.edu Dimitri Theodoratos Department of Computer Science New Jersey Institute of Technology, USA dth@njit.edu Yannis Vassiliou School of Electrical and Computer Engineering National Technical University of Athens, Greece yv@cs.ntua.gr ABSTRACT Keyword search is the most popular querying technique on semistructured data. Keyword queries are simple and con- venient. However, as a consequence of their imprecision, there is usually a huge number of candidate results of which only very few match the user’s intent. Unfortunately, the existing semantics for keyword queries are ad-hoc and they generally fail to “guess” the user intent. Therefore, the qual- ity of their answers is poor and the existing algorithms do not scale satisfactorily. In this paper, we introduce the novel concept of cohesive keyword queries for tree data. Intuitively, a cohesiveness relationship on keywords indicates that they should form a cohesive whole in a query result. Cohesive keyword queries allow term nesting and keyword repetition. Cohesive key- word queries bridge the gap between ﬂat keyword queries and structured queries. Although more expressive, they are as simple as ﬂat keyword queries and not require any schema knowledge. We provide formal semantics for cohesive key- word queries and rank query results on the proximity of the keyword instances. We design a stack based algorithm which eﬃciently evaluates cohesive keyword queries. Our experiments demonstrate that our approach outperforms in quality previous ﬁltering semantics and our algorithm scales smoothly on queries of even 20 keywords on large datasets. 1. INTRODUCTION Keyword search has been by far the most popular tech- nique for retrieving data on the web. The success of keyword search relies on two facts: First, the user does not need to master a complex structured query language (e.g., SQL, XQuery, SPARQL). This is particularly important since the vast majority of people who retrieve information from the web do not have expertise in databases. Second, the user does not need to have knowledge of the schema of the data sources over which the query is evaluated. In practice, in the c 2016, Copyright is with the authors. Published in Proc. 19th Inter- national Conference on Extending Database Technology (EDBT), March 15-18, 2016 - Bordeaux, France: ISBN 978-3-89318-070-7, on OpenPro- ceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 web, the user might not even be aware of the data sources which contribute results to her query. The same keyword query can be used to extract data from multiple data sources which have diﬀerent schemas and they possibly adopt dif- ferent data models (e.g., relational, tree, graph, ﬂat docu- ments). There is a price to pay for the simplicity, convenience and ﬂexibility of keyword search. Keyword queries are imprecise in specifying the query answer. They lack expressive power compared to structured query languages. As a consequence, there is usually a very large number of candidate results from which very few are relevant to the user intent. This weakness incurs at least two major problems: (a) because of the large number of candidate results, previous algorithms for keyword search are of high complexity and they cannot scale satisfactorily when the number of keywords and the size of the input dataset increase, and (b) correctly identi- fying the relevant results among the plethora of candidate results, becomes a very diﬃcult task. Indeed, it is practi- cally impossible for a search system to “guess” the user intent from a keyword query and the structure of the data source. The focus of this work is on keyword search over web data which are represented as trees. Currently, huge amounts of data are exported and exchanged in tree-structure form [31, 33]. Trees can conveniently represent data that are semistructured, incomplete and irregular as is usually the case with data on the web. Multiple approaches assign semantics to keyword queries on data trees by exploiting structural and semantic features of the data in order to au- tomatically ﬁlter out irrelevant results. Examples include Smallest LCA [18, 40, 37, 10], Exclusive LCA [16, 41, 42], Valuable LCA [11, 20], Meaningful LCA [24, 38], MaxMatch [28], Compact Valuable LCA [20] and Schema level SLCA [15]. A survey of some of these approaches can be found in [29]. Although ﬁltering approaches are intuitively rea- sonable, they are suﬃciently ad-hoc and they are frequently violated in practice resulting in low precision and/or recall. A better technique is adopted by other approaches which rank the candidate results in descending order of their esti- mated relevance. Given that users are typically interested in a small number of query results, some of these approaches combine ranking with top-k algorithms for keyword search. The ranking is performed using a scoring function and is usually based on IR-style metrics for ﬂat documents (e.g., tf*idf or PageRank) adapted to the tree-structure form of the data [16, 11, 3, 5, 21, 10, 38, 23, 29, 32]. Nevertheless, Series ISSN: 2367-2005 137 10.5441/002/edbt.2016.15