Paper 44-28 - 1 - Make Your SAS/ACCESS® Query More Efficient Jianming He, Dinesh Jain, Cheng Wang, Ph.D. Quest Diagnostics, Inc., Lyndhurst, NJ 07071 ABSTRACT This paper discusses how to query from relational databases efficiently by using SAS/ACCESS. It integrates the idea that database developers perform data query into SAS environment. It also shows some tips for SAS programmers doing data extraction and data manipulation. INTRODUCTION For some SAS/ACCESS® beginners, they can write elegant SQL scripts for their tasks. Sometimes to their surprise, the SQL scripts do not work as they expected. The query might run forever without anything coming back. However, the database servers might be in jeopardy since the query occupies a lot of computer resources. The problems could be a table scan, improper index, etc. Extracting large data from relational database (RDB) requires knowledge more than PROC SQL procedure. It needs comprehensive understanding of the RDB. SAS/ACCESS® provides a good tool set for data warehousing. SAS programmers should have sufficient knowledge and experience as database developers do in order to query database efficiently. Though there are some literatures talking about retrieving data from RDB by SAS/ACCESS, most of them skip the part of RDB programming. With a mind of the database developer performing data query, this paper will discuss the data query strategy briefly 1 . It will show some tips that might be useful for a SAS/ACCESS ® programmer by using examples. All examples are for SYBASE environment. Interested readers may easily move them to other relational database environment. DATA QUERY METHODOLOGY A SQL script is not simply joining with several tables. To optimize the performance, it needs good understanding of how queries are handled and how queries are executed inside the database. This section describes the query execution process and techniques to write an efficient query. 1) Query Execution Process inside Database server Once the query is passed to the database server from the SQL Procedure Pass-Through Facility, the database server parses and normalizes the query. The parser ensures that the SQL syntax is correct. Normalization ensures that all the objects referenced in the query exist. Permissions are also checked during this process to ensure that the user has permission to access all tables and columns in the query. Query preprocessing changes some search arguments to an optimized form and adds optimized search arguments and join clauses. If no errors are found, the parsed query is passed to the query optimizer. The query optimizer uses statistics about the tables, indexes and columns named in the query, and predicts the cost of using alternative access methods to resolve a particular query. The output of the optimizer is the query plan – the plan that is least costly in terms of I/O. The query plan is compiled with the 1 For further investigation, readers may refer to any database developer’s handbook. Please see the reference. code that contains the ordered steps to carry out the query, including the access methods (table scan or index scan, type of join to use, join order, and etc.) to access each table. Once the query execution plan is complete, the database server executes the query and returns the data to the SQL Procedure Pass- Through Facility. 2) Working with the optimizer The goal of the optimizer is to select the access method for each table that reduces the total time needed to process a query. The optimizer bases its choice on the statistics available for the tables being queried and on other factors such as cache strategies, cache and I/O size. A major component of optimizer decision- making is the statistics available for the tables, indexes and columns. In some situations, the optimizer may seem to make the incorrect choice of access methods. This may be the result of inaccurate or incomplete information (such as out-of-date statistics). In other cases, additional analysis and the use of special query processing options can determine the source of the problem and provide solutions. The query optimizer uses I/O cost as the measure of query execution cost. The significant costs in query processing are the physical I/O (when pages must be read from disk) and the logical I/O (when pages in cache are read for a query). The two significant outputs from the query plan are the table access method and the order of table access. 3) Table Access method in the optimizer For each table in the query the optimizer tries to determine the best path by choosing from the table scan, the index scan (use clustered index) or covering index scan (use non-clustered index). Other than situations where the table scan is cheaper than the index scan, SAS programmers need to avoid the table scan as much as possible. During the table scan process, the database server reads every data page in the table because no useful indexes are available to help retrieve the data it needs. Every data page access needs disk read, which in turn causes poor response time from the system. It also affects the performance of other queries on the server. To avoid the table scan for the query, SAS programmers can give hint to the database by choosing a proper index inside the SQL query. This feature can be used when multiple indexes are available on the table and the optimizer picks index based on its statistics, which may not be the fastest one for your query. A hint suggests that the optimizer use the index defined in the query instead of index based on its algorithm. 4) Order of Table access defined by optimizer If multiple tables are accessed through the SQL query, the optimizer defines the sequence in which all tables will be accessed. This order is the most important factor for the performance of a query. The table accessed first returns a set of rows based on the condition defined in the “where” clause. The rest of the tables are accessed iteratively based on the results returned by the first query. SAS Programmers need to make sure of two things in this situation. The table accessed first should be a qualifying table for the query and should return less number of rows or maximum SUGI 28 Beginning Tutorials SUGI 28 Applications Development