Enhancing file transfer scheduling and server utilization in data distribution infrastructures Daniel Higuero, Juan M. Tirado, Florin Isaila, Jes´ us Carretero Computer Architecture and Technology Area Universidad Carlos III de Madrid Madrid, Spain {dhiguero, jtirado, florin, jcarrete}@arcos.inf.uc3m.es Abstract—This paper presents a methodology for efficiently solving the file transfer scheduling problem in a distributed envi- ronment. Our solution is based on the relaxation of an objective- based time-indexed formulation of a linear programming prob- lem. The main contributions of this paper are the following. First, we introduce a novel approach to the relaxation of the time-indexed formulation of the transfer scheduling problem in multi-server and multi-user environments. Our solution consists of reducing the complexity of the optimization by transforming it into an approximation problem, whose proximity to the optimal solution can be controlled depending on practical and compu- tational needs. Second, we present a distributed deployment of our methodology, which leverages the inherent parallelism of the divide-and-conquer approach in order to speed-up the solving process. Third, we demonstrate that our methodology is able to considerably reduce the schedule length and idle time in a computationally tractable way. I. I NTRODUCTION The last years have shown a continuous increase in the volumes of data to be stored and retrieved on-line. As this evolution is predicted to continue, large-scale data distribution systems face considerable challenges in assuring quality of service, while keeping the infrastructure costs at profitable levels. Consequently, achieving a high resource utilization has become of utmost importance for every large scale data provider. In this paper we address the problem of finding an efficient data transfer schedule for multi-server and multi-user scenario. In particular, we address the problem of transferring a set of files to a set of users, given a certain number of servers, under several constraints such as maximum available bandwidth. We study how to come up with a schedule that delivers all files to all users, while reducing the schedule length and maximizing the server utilization. Giving the complexity of the scheduling problem, we assume an a-priori knowledge of the file requests, for instance like in publish-subscribe systems. This knowledge may come from an existing subscription database, a forecast of the expected requests, or a combination of both. There are three main objectives that we target: a) to min- imize the schedule length, b) to maximize the file server utilization, and c) to find the schedule in a computationally tractable way. Fulfilling all these objectives is a challenging task, as only calculating the optimal schedule in a multi-server multi-user environment can be shown to be NP-complete. In order to address this issue, we seek to find a practical balance among these objectives. Our solution is based on the relaxation of an objective-based time-indexed formulation of a linear programming problem. The main contributions of this paper are the following. First, we introduce a novel approach to the relaxation of the time-indexed formulation of the transfer scheduling problem in multi-server and multi-user environments. Our solution consists of reducing the complexity of the optimization by transforming it into an approximation problem, whose prox- imity to the optimal solution can be controlled depending on practical and computational needs. Second, we present a distributed deployment of our methodology, which leverages the inherent parallelism of the divide-and-conquer approach in order to speed-up the solving process. Third, we demonstrate that our methodology is able to considerably reduce the schedule length and idle time in a computationally tractable way. The remainder of this paper is organized as follows. Section II describes the baseline model. Section III presents the reformulation of the previous problem as a feasibility model, and describes how to use a distributed architecture to solve the scheduling problem. Section IV shows the evaluation of the system. Section V presents the related work. Finally, section VI describes future work and conclusions. II. TRANSFER SCHEDULING PROBLEM The transfer scheduling problem can be stated as an opti- mization problem as follows. Given a set of servers S, a set of files F distributed over S, a set of requests R (as a union of tuples < dst, f > representing requests from user dst to file f ), what is a schedule of optimal length? Figure 1 shows an overview of the transfer scheduling setup. Several versions of this problems may be formulated by adding constraints such as the server available bandwidth, number of requests served at a time, etc. Different base problems can be used to model the file transfer scheduling problem: graph-coloring, job shop prob- lem, maximum flow network, etc. In this paper we choose to formulate our base model as an extension of the open job shop problem [1]. This problem consists of n independent jobs to be processed by m parallel machines. Each job i is composed of a set of operations. The operation O ij over job i is processed by machine j for p ij time units. All operations of a job have to be processed on all machines. Only one operation can be