An Evolutionary Approach to the Index Selection Problem Javier Calle, Yago Sáez, Dolores Cuadra Computer Science Department - Carlos III University of Madrid {fcalle,ysaez,dcuadra}@inf.uc3m.es Abstract— In this paper, evolutionary algorithms are explored with the objective of demonstrating that they offer the most efficient and adequate solution to the Index Selection Problem (ISP). The final target is to develop a self-tuning database system requiring little (or no) intervention from experts in physical design. Following the evaluation of the proposal and the discussion of experimental results, conclusions are made regarding the possibilities presented by evolutionary algorithms for future projects. Keywords- Index Selection Problem, Evolutionary Algorithms, Self-Tuning. I. INTRODUCTION One of the principal concerns present when trying to improve the performance of a database instance is that of finding the most appropriate physical design for that database. Within these general concerns, the Index Selection Problem (ISP) can be defined as the search for a particular combination of indexes such that the cost for a given workload in the database is minimized. This problem has been traditionally formalized as the linear combination of 0-1 values on a string of variables, each representing a different candidate index (i.e., 0-1 integer linear programming) [7]. Being NP-hard [9], the implementation of the ISP ought to consider certain restrictions that allow for a solution to be reached in a reasonable amount of time. Almost all proposed solutions to the ISP, for example, first begin with a search for a reduced subset of candidate indexes. Later, many of these proposals focus on the search for heuristics and efficient pruning techniques in order to avoid the exploration of index combinations known, a priori, to be ineffective. Finally, certain authors [14] opt for statistics-based simulations (rather than taking measurements on real environment executions) in order to save time and not affect databases in use. While the algorithms usually studied as potential solutions to the ISP fix the characteristics of the database and the database management system (DBMS), and even set the workload as static, it is nevertheless the case that each of these parameters is, in reality, dynamic. Thus, it would be much more proper to find an algorithm capable from the start of adapting dynamically to any change in these parameters. Furthermore, a number of additional studies can be found focusing on specific system types (e.g., relational systems, OLTP, etc.) or even certain auxiliary structures [4]. The ability, therefore, of an algorithm to find general solutions applicable to distinct systems and given diverse structures would make its use even more recommendable. Given these considerations, evolutionary algorithms present themselves as the most promising solution, insofar as they can perfectly adapt to the definition of the problem and, in addition, can dynamically adapt to variations in the objective function [10] [11]. Other related studies demonstrating the viability of the proposed solution can also be found [18]. It is the principal objective of this study to offer an actual measurement of the goodness and superiority of this proposal when compared with the results obtained by frequently-used tools, as well as expert database administrators. This first evaluation, therefore, is made using a relational database in static conditions (i.e., fixing characteristics of the database, DBMS and workload), looking for any type of auxiliary structure offered by the DBMS. It is the hope of the authors of this study that the demonstrated relative goodness of evolutionary algorithms as a solution to the ISP may ground future studies that are more general (i.e., applicable to distinct systems) and focus on dynamic conditions. II. RELATED WORK AND PROPOSAL Let C be the set of candidate indexes for a database with a cost function f: W x DB i x P(C) R providing a real value for the execution of workload (W) on a database with a given state (DB i ) and a particular physical design which, for reasons of brevity, is here restricted to the selection of an index combination from the power set of C that is k P(C). The ISP can be now defined as the search for k which minimizes the cost. One of the first problems to be addressed when selecting the correct indexes for a particular database schema is to decide the workload for which the indexes will optimize the performance of the database. The workload is a significantly large set of updating and query instructions (i.e., sentences) representing the operations that occur in a database. One way of selecting a representative workload is by utilizing the logging capabilities [2] of many DBMSs to capture the trace of queries and modifications made in any of those particular systems. In certain published works [3], for example, the new self-tuning characteristics of Oracle RDBMS are presented. Influencing in large part the selection of a good physical design is the analysis of the most frequent operations with the largest margin for improvement. In the proposal presented here, design is grounded in a set of candidate indexes which will be iteratively optimized during the execution of a predefined set of operations. To select them, Oracle uses the Automatic Workload Repository (AWR) which is updated every hour with operations and statistics collected by the DBMS. The second problem to resolve is the selection of a set of candidate indexes for a given workload. Given that, in many cases, the search space of candidates is often computationally 485 978-1-4577-1124-4/11/$26.00 c 2011 IEEE