Towards Planning the Transformation of Overlays Young Yoon 1 , Nathan Robinson 2 , Vinod Muthusamy 3 , Sheila McIlraith 2 , Hans-Arno Jacobsen 2 1 Samsung Electronics, 2 University of Toronto, 3 IBM T.J. Watson Research Center I. I NTRODUCTION Overlay networks are prevalent in distributed systems, and are used by a wide variety of applications [1]–[4]. The topology of an overlay network is critical to the performance of the overlay, and there is extensive work on designing overlay topologies to achieve a variety of performance metrics such as minimizing path lengths, controlling node degrees, or maintaining redundant paths [5,6]. In a dynamic system, an overlay may become obsolete and require reconfiguration. For example, changes in traffic patterns, physical network charac- teristics, and application requirements may render an existing topology sub-optimal, and necessitate reconfiguration [6]. 1 5 2 4 0 6 3 1 5 2 4 0 6 3 Old topology New topology P 1 S 1 S 1 P 1 S 2 P 2 S 2 P 2 Message stream Edge to be removed Goal edge Fig. 1. Reconfiguring an overlay 0 1 2 0 1 2 Initial state Goal state SHIFT(2, 1, 0) r g Fig. 2. Example of SHIFT operation Suppose we have an overlay, as shown in Figure 1. Brokers (B) are interconnected to route messages from P 1 and P 2 to S 1 and S 2 , who are interested in the messages. Assume that S 1 and S 2 demand more timely delivery of messages. This demand may be met by reducing the average length of the paths over which the messages produced by P 1 and P 2 must travel. This reduction can be achieved by reconfiguring the topology, That is, new routing paths B 0 -B 4 and B 5 -B 6 are established, while B 1 -B 2 and B 2 -B 3 are disconnected. In this work, we define the incremental topology transformation (ITT) problem, which seeks to find a plan, or sequence of incremental overlay reconfiguration steps, that yields the least disruption [7]. Each incremental step is chosen from a set of reliable primitive transformation operations that ensure reliable message delivery. Any such plan must consider the disruption to message delivery that arises from any routing state updates applied during the overlay reconfiguration. As well, the plan must carefully coordinate the reconfiguration steps to avoid transient loops, and message loss or reordering [8]. Moreover, the solution space of possible plans grows extremely quickly with the size of the overlay, and an exhaustive search is infeasible for typical enterprise overlays with hundreds of nodes [9,10]. Note that this paper differs from existing work on the mechanics of reconfiguring an overlay at runtime [11,12]. Here we are concerned with devising a transformation plan that minimizes disruption during reconfiguration. II. PROBLEM AND SOLUTION An operation that is suitable for an incremental transformation of a running topology was first introduced in 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 Start S1 S2 S3 S4 S5 S6 S7 Goal Removable edge Goal edge Fig. 3. A plan with 7 SHIFT steps 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 Start S1 S2 S3 Removable edge Goal edge Goal Fig. 4. A shorter plan of the problem in Figure 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 1 5 2 4 0 6 3 11 total routing state updates 5 total routing state updates Message stream Removable edge Goal edge Fig. 5. A plan with MOVE operations. The order of executing the two MOVE operations can yield different number of routing state updates. [8], i.e., SHIFT(v i ,v j ,v k ). It is already proven that there exists a sequence of these operations to transform any topology of the type we consider into any other with the same type. As shown in Figure 2, SHIFT(v i ,v j ,v k ) replaces the edge (v i ,v j ) with edge (v i ,v k ), where {v i ,v k }∈ N(v j ) and v i = v k . N(v x ) is the set of neighbors of v x . We introduce another atomic operation: MOVE(v i ,v j ,v k ,v l ) which directly replaces the removable edge (v i ,v j ) with the goal edge (v k ,v l ). A plan may seek to minimize the number of steps in the plan or the number of message streams affected by the transformation. Consider the two plans in Figures 3 and 4 that achieve the same transformation but with the latter requiring fewer steps. Alternatively, consider the two plans in Figure 5 that use the same number of MOVE operations. With the first plan, the message stream between v 0 and v 4 is disrupted by a MOVE operation that requires all vertices to update their routing states. For the second plan, the goal edge (v 0 ,v 4 ) is established first, and requires fewer total routing state updates. We have encoded the ITT problem into PDDL [13] but found that the state-of-the-art domain-independent planning systems are unable to solve PDDL encodings of our problems effectively. For example, neither LAMA [14] nor PROBE [15] were able to solve problems with more than 50 nodes, where 50% of edges were changed between the start and goal graphs, within 10 minutes on a machine with 16 GB of memory and an Intel Xeon 3.00 GHz processor. From an in-depth study of the problem, we have identified the following properties, from which we derive a number of