Distributed Approximation Algorithm for Resource Clustering Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Hubert Larchevˆ eque Universit de Bordeaux, INRIA Bordeaux Sud-Ouest, Laboratoire Bordelais de Recherche en Informatique Abstract. In this paper, we consider the clustering of resources on large scale platforms. More precisely, we target parallel applications consisting of independant tasks, where each task is to be processed on a different cluster. In this context, each cluster should be large enough so as to hold and process a task, and the maximal distance between two hosts belonging to the same cluster should be small in order to minimize la- tencies of intra-cluster communications. This corresponds to maximum bin covering with an extra distance constraint. We describe a distributed approximation algorithm that computes resource clustering with coordi- nates in Q in O(log 2 n) steps and O(n log n) messages, where n is the overall number of hosts. We prove that this algorithm provides an ap- proximation ratio of 1 3 . 1 Introduction The past few years have seen the emergence of a new type of high performance computing platform. These highly distributed platforms, such as BOINC [3] or WCG [2] are characterized by their high aggregate computing power and by the dynamism of their topology. Until now, all the applications running on these platforms (seti@home [4], folding@home [1],...) consist in a huge number of independent tasks, and all data necessary to process a task must be stored locally in the processing node. The only data exchanges take place between the master node and the slaves, which strongly restricts the set of applications that can be performed on this platform. Two kind of applications fit in this model. The first one consists in application, such as Seti@home, where a huge set of data can be arbitrarily split into arbitrarily small amount of data that can be processed independently on participating nodes. The other application that are executed on these large scale distributed platforms correspond to Monte- Carlo simulations. In this case, all slaves work on the same data, except a few parameters that drive Monte Carlo simulation. This is for instance the model corresponding to Folding@home. In this paper, our aim is to extend this last set of applications. More precisely, we consider the case where the set of data needed to perform a task is possibly too large to be stored at a single node. In this case, both processing and storage must be distributed on a small set of nodes that will collaborate to perform the task. The nodes involved in the cluster should have an aggregate memory larger than a given threshold, and they should be close