Selectivity & Cost Estimates in Query Optimization in Distributed Databases Ridhi Kapoor, Dr.R.S.Virk Department of Computer Science & Engineering, Guru Nanak Dev University Amritsar, Punjab, India E-Mail: 1 ridhikapoor89@gmail.com, 2 tovirk@yahoo.com Abstract—Query optimizers are critical to the efficiency of modern relational database systems. If a query optimizer chooses a poor query execution plan, the performance of the database system in answering the query can be very poor. This study describes that there are numerous alternative ways to execute a query. These are so called execution plans. A component in the database management system called the Query Optimizer decides how to pick an efficient execution plan. For this the optimizer deploys cost-based optimization. Approximate execution costs are calculated for various plans, and one with low cost is chosen. The execution cost is a weighted function of the system resources needed to execute the query. Examples of such system resources are the CPU time or the number of I/O operations. In order to come up with reasonable cost estimates, the optimizer needs to estimate the size of sub-queries. This is important, for instance, when choosing the join order of the relations. To estimate the sizes of sub-queries, the optimizer needs to know the selectivity of the query predicates. Keywords— Query Optimization, Distributed Databases, Cost Based Query Optimizers, Selectivity, Response Time, Total Time. I. INTRODUCTION A DDB query is answered by joining tables. In a distributed database, tables reside on different nodes of a computer network; to join tables, data must be moved between nodes. Consequently, the cost of a distributed query includes a processing cost (the joins) and a transmission cost[1]. In a query, the order of joins is not specified. Distributed query optimization involves finding an efficient order for the required joins. Before moving a large table across the network, it may be possible to reduce its size by restricting it to just those rows that are related to the table to which it will be joined. However, to effect this reduction, another table must be sent across the network and an additional join performed. As the number of tables in a query increases, the number of possible join schedules grows at least exponentially; an exhaustive search for the minimum cost schedule is not feasible. Optimizer needs, the selectivity of a query, i.e., the number of records that qualifies to a query, in order to generate an efficient query execution plan. The query optimizer can generate several execution plans for the same query. To choose the execution plan having the response time close to the optimal, the optimizer is based on a cost model. Distributed Database [2]: - A database that consists of two or more data files located at different sites on a computer network. Query consists of operations on tables. Most commonly performed operations are Select (σ): Returns tuples that satisfy a given predicate Project (π): Returns attributes listed Join (⋈): Returns a filtered cross product of its arguments Set operations: Union, Intersect, and Difference Query Processing:-It is defined as the activities involved in parsing, validating, optimizing and executing a query. The main aim of query processing is to transform a query written in high level language(eg.SQL) into efficient and correct strategy expressed in low level language(implementing low- level language). High level user query -> Query Processor ->low-level data manipulation commands. Figure 1: Cost Based Estimation Query Optimization:-It refers to the process by which the best execution strategy for a given query is found from a set of alternatives. Query optimization is a part of query processing. The main aims of query optimization are to choose a transformation that minimizes resource usage, reduce total execution time of query and also reduce response time of query. Cost Based Query Optimization:-It assigns an estimated "cost" to each possible query plan, and chooses the plan with the smallest cost[1]. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations and CPU requirements, and other factors determined from the dictionary. The set of query plans examined is formed by examining the possible access paths (e.g. index scan, sequential scan) and join algorithms (e.g. sort-merge join, hash join, nested loop join). The search space