1 A Heuristic Approach to Vertical Fragmentation Incorporating Query Information Hui Ma, Klaus-Dieter Schewe and Markus Kirchberg Massey University, Information Science Research Centre & Department of Information Systems Private Bag 11222, Palmerston North, New Zealand Abstract— Fragmentation, allocation and replication are database distribution design techniques that aim at improving the system performance. Among the two fragmentation techniques, vertical fragmentation is often considered more complicated than horizontal fragmentation, because the huge number of alternatives makes it nearly impossible to obtain an optimal solution to the vertical fragmentation problem. Therefore, we can only expect to find out a heuristic solution. Often fragmentation and allocation are considered separately, disregarding that they are using the same input information to achieve the same objective, i.e. improve the overall system performance. This paper addresses vertical fragmentation and allocation simultaneously in the context of the relational model. The core of the paper is a heuristic approach to vertical fragmentation, which uses a cost model and is targeted at globally minimising these costs. I. I NTRODUCTION In the literature many algorithms for vertical fragmentation using attribute affinities have been proposed. Hoffer and Sever- ance [2] measure the affinity between pairs of attributes and try to cluster attributes according to their pairwise affinity by using the bond energy algorithm(BEA). Navathe et al. [8] extend the BEA approach and propose a two-phase approach for vertical partitioning. In the first step, they use an Attribute Usage Matrix (AUM) to construct the attribute affinity matrix (AAM) on which clustering is performed. In the second step, estimated cost factors, which reflect the physical environment of frag- ment storage, are considered to further refine the partitioning schema. Cornell and Yu [3] apply the work of Navathe et al. [8] to the physical design of relational databases. This approach uses specific physical factors such as the number of attributes, their length and selectivity, and cardinality of the relation. Navathe and Ra [9] construct a graph-based algorithm to solve the vertical partitioning problem, where the heuristics used includes an intuitive objective function that is not explicitly quantified. The algorithm presented in ¨ Ozsu and Valduriez [10] takes AAM as input, employs the Bond Energy algorithm to evaluate the togetherness of a pair of attributes, and processes an attribute clustering algorithm. Muthuraj et al [7] propose a formal approach to address the problem of an n-array vertical partitioning problem and derive a partition evaluator function that describes the affinity value for clusters of different sizes. To build an attribute affinity matrix, an Attribute Usage Matrix (AUM) and an Attribute Access matrix are used. The latter one shows where the queries are issued. However, during the process of building the AUM site information of the queries is lost. Attribute affinities between a pair of attributes only measure possibilities that attributes are accessed together. Only later, at the stage of allocation, site-specific access measures are employed to determine the allocation of fragments. We argue that the site information should already be used at the stage of fragmentation to meet the needs of data at different sites and to reduce the query processing costs at the same time. In a nutshell, we will incorporate all query information, including the site information, by using a simplified cost model at the stage of vertical fragmentation. Doing this way, we can obtain vertical fragmentation and fragment allocation simulta- neously with low computational complexity and resulting high system performance. The remainder of this paper is organised as follows. We start with a brief review of vertical fragmentation, a cost model and the discussion of the impact of vertical fragmentation on query costs in Section 2. In Section 3, we first present an example to show the problems of some existing approaches of vertical fragmentation; then introduce a heuristic approach on vertical fragmentation based on a simplified cost model, followed by discussions with examples. We conclude with a short summary in Section 4. II. VERTICAL FRAGMENTATION,ALLOCATION AND A COST MODEL Vertical fragmentation and allocation are distribution design techniques to improve the system performance by reducing the total query costs. In this section we first review vertical fragmentation and allocation by presenting formal definitions, followed by a discussion of the impact of vertical fragmenta- tion on optimised query trees. Then we introduce a cost model that can be used to calculate query costs against query trees.