Effective Keyword Search over Relational Databases Considering keywords proximity and keywords N-grams Sina Fakhraee Dept. of Computer Science Wayne State University Detroit, MI 48202, USA Fakhraee@wayne.edu Farshad Fotouhi Dept. of Computer Science Wayne State University Detroit, MI 48202, USA Fotouhi@wayne.edu Abstract— The current amount of text data in relational databases is massive and is growing fast. This increases the importance and need for non-technical users to be able to search for such information using a simple keyword search. Researchers have studied and addressed some of the issues with the efficiency and effectiveness of answering keyword queries in relational databases. In this paper we have summarized different factors affecting the effectiveness of the keyword search in relational databases which were studied in the early works. We have also identified two other important factors, namely keywords proximity and keywords N-grams that can further improve the search effectiveness when incorporated in the existing state-of-the-art information retrieval relevance ranking strategies for relational databases. Our experiments show that applying these two factors to the current ranking functions will improve the effectiveness of keyword search in relational databases. I. INTRODUCTION The current amount of text data in relational databases is massive and is growing fast. This increases the importance and need for non-technical users to be able to search for information using a simple keyword search just as they would search for text documents on the web. Keyword search over relational databases (KSRDBs) enables ordinary users to query relational databases by simply submitting keywords without having to know any SQL or having any knowledge of the underlying structure of the data. Finding answers to keyword search over relational databases is a very challenging task, since good answers should be assembled by joining tuples from multiple relations across the database. Now, the effectiveness of keyword search over relational databases is even more of a challenge since unlike the text databases, relational databases have a much richer structure. In relational databases the search keywords are usually not just simply found in a single text attribute but they can be found in different text attributes of different relations, each of which having different degree of relevance to the search keyword(s). The first couple of works in this area try to capture all the inter-connected tuples (i.e records from different relations joined on their primary-foreign keys) containing the exact Keywords and then rank the results purely based on the distance of the keyword-containing tuples from one another. This kind of approach is not very efficient, since most users are only interested in the first top-k search results. This approach is also not very effective because besides only the distances of the keyword-containing tuples from one another, other factors should be taken into account when ranking the answers. Modern relational databases have incorporated the state-of-the-art information retrieval (IR) relevance ranking techniques at the attribute level. Recent works in KSRDBs have exploited this functionality to answer keyword queries by identifying all database tuples that have a non-zero score for a given keyword search. Once these tuples are found, the first top-k tuples containing the keywords from each relation, if joinable, are joined via their primary-foreign key relationships and the ones which collectively contain the search keywords are presented to the users as the search results. Taking advantage of IR relevance ranking strategies employed by modern relational database management systems (RDBMSs) has improved both the efficiency and effectiveness of keyword search in relational databases. In this paper we have studied and summarized the IR style techniques used in recent research. Our key contribution is identifying two other important factors that can further improve the search effectiveness when incorporated into the IR relevance ranking strategies. These two factors are 1) the query keywords proximity, which is the overall distance of the keywords from one another in the value of the target text attribute and 2) the N- grams and in particular the quadgrams, of the query keywords in both the query itself and in the text attributes’ values. Our experiments show that incorporating these two factors into the existing state-of-the-art ranking function will improve the effectiveness of KSRDBs. The remainder of this paper is organized as follows: Section II gives a brief overview of query result generation and describes basic concepts and definitions used in other literature for KSRDBs. Section III discusses related work and background of ranking used in KSRDBs. Section IV describes the keywords proximity and N-grams and how to incorporate them into the existing ranking function. Section V presents our experimental results. Section VI concludes our paper and gives direction for future work. II. OVERVIEW OF QUERY RESULT GENERATION In this section we describe how the results of a keyword query in relational databases are generated. Section A gives basic concepts and definitions used in the KSRDBs literature. Section B gives an algorithm to generate the results. The definitions and the algorithm given in sections A and B are adopted from previous works [2,4].