USING JOIN OPERATlONS AS REDUCERS IN DISTRlBUTED QUERY PROCESSING Ming-Syan Chen and Philip S. Yu IBM Thomas J. Watson Research Ccntcr P.O. Box 704, Yorktown Heights, New York 1059X ABSTRACT Scmijoin has traditionally been relied upon for reducing the communication cost required for distr-iburcd query prcjcessing. Ilowever, judiciously applying join nperations as rcduccrs can lead to further reduction in the communication cost. In view of this fact, we explore in this paper the approach of using join operations, in addition to semijoins, as rcduccrs in distributed query processing. We first show that the problem of determining a sequence of join operations for a query graph can he transformed to that of iinding a set of cuts to that graph, where a cut to a graph is a partition of the nodes in that graph. In light of the mapping we develop an eff%ient heuristic algorithm to dctcrminc an cffcctive scqucnce of join reducers for a query. The algorithm using the concept of divide-and-conquer is shown to hnvc polynomial time complexity. Examples are also given to illustrafc our results. 1. Introduction In a distributed rclatiotlnl dalabase system, the processing of a query involves data transmission among diffcrcnt sites via a computer network. As pointed out in [2X], the processing of a ciis~ributccl query is composed of the following three phases: (I) local prow.rsing /J/IO.TP which involves all local processing such as selections and projections, (2) rrdrrctinrr pha.re whcrc a sequence of semijoins is used to reduce the size of relations. and thus, lessen the total communication cost required, and (3)~finulI,roc~s,rifz~ pAn.rc in which all resulting relations arc sent to the site where the final query processing is perfortncd. In such a distrihuted database system, the ohjcctive is mainly to rcducc the communication cost required for data transmission C-53, In view of this fact, significant rcsrarch efforts have bccri focused on the problem of reducing the amount of data transmission rcquircd for phases (2) and (3) of ciistrihutccl query processing [ I]-[1 I] [ l4J-[IR] [20]-[24]. The scmijoin operation especially has rcccived considrrahlc attention anti been extensively studied in the literature. It has hccn provcci that a tree query can be fully rcduccd by using scmijoin [23, and thcrc has bcm much research rcporlrrl in dcl:cloping optirnnl scmijoin sequrnccs to process a tree query [7] [lo]. Ilowcver, there is no polynoti~iaI time algorithm ptoposcd for processing gcncral tree qucrics. I;or general query graphs, cvrn with one join attrihutc, the problcln of finding an optimal strategy lo miuimizc the data transmission cost has been proved to hc NP-hard [IS]. Note that in addition to scmijoins, join operations can also hc uscci as rcduccrs in distributed quo-y proccssinp to further rcducc the communication cost [8] [ 173 [IQ]. Moreover, as shown in [Xl. thr approach of using join opcmtions as rcduccrs in distrihutctl qiicr) processing not only can result in more profitable scmijoins due to the inclusion of joins as rcduccrs’ , hut also may reduce the communication cost further by taking advantngc of the rcmovahility of pure join nttrihutrs*. Ilowcrcr, while a signiiicant amount of research results arc available on the application of srmijoins to distrihutcd query processillg, vrl-y little attcntinn in contrast was paid to charnctcrizc Ihc application of join operations as rcduccrs. To rcrneciy this, we focus in this paper on the study or using join operations, in addition to scmijoins, as rcducrrs in distributed qllrry processing. A cut lo a graph is a partition of the notlrs in Ihat graph. We introduce a specific type of cuts, referred fo as the cornplefe and /iimihlrr (Cl:) act of cnts (See Section 2 for a formal definition of a (:F set of cuts.). As it will be shown, the prohlcm of determining a sequence of join operations for a query graph can hc transformed to that of finding a (11; set of cuts to that Faph. In light of such a Inapping. wc dcvclop a polynomial time heuristic algorithm based on the concept of divide-and-conquer to determine an cffcctive sequcncc of join rcduccrs for tree queries. Note that the concept of the (3; set of cuts can also be applied to gcncral qucrics which may have cycles in their corresponding query graphs. An algorithm based on fhc concept of the C:P set or cuts to find a scqucncc of join reducers for general queries can be found in [9]. It is worth mentioning that the original problem of finding a scquencc of join rcduccrs scc1ns 1norc naturally to hc tacklccl by the state-apace search type approaches [IX] [22], such as the A’ search, which are usually more intrctahle. ‘l‘his fact justifies the advantage of the mapping we proposed. As mentioned carlicr, there are very few works reported using join operations as reducers for distrihuted query processing, and to the hcst of our kt~cnvlrdge, none of them has either fully explored thr thcorctical aspects of, or devclc~prd heuristic algorithms for such an approach. This fact distinguishes our work from othcta. Note that join reducers do not diminish the significance of scmijoins as reducers. They arc in fact applied in conjunction with sctnijoin rcduccrs, which can hc dcrivcd haseci on the previous rcscarch. One can view the query graph used in finding the Cl’ set Of cuts as the rcsriltntit query graph after the scmijoin reducers have hccn applird. 116 CH28951/90/0000/0116$01 .OO0 1990 IEEE