Cooperative Scheduling of Bag-of-Tasks Workflows on Hybrid Clouds Rubing Duan † , Radu Prodan ∗ † Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore ∗ Institute of Computer Science, University of Innsbruck, Austria Email: radu@dps.uibk.ac.at, duanr@ihpc.a-star.edu.sg c 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: https://doi.org/10.1109/CloudCom.2014.58 Abstract—We address the problem of scheduling a class of large-scale applications inspired from real-world on hybrid Clouds, characterized by a large number of homogeneous and concurrent tasks that are the main sources of bottlenecks but open great potential for optimization. We formulate the scheduling problem as a new sequential cooperative game and propose a communication- and storage-aware multi-objective algorithm that optimizes two user objectives (execution time and economic cost) while fulfilling two constraints (network bandwidth and storage requirements). We present comprehen- sive experiments using both simulation and real-world applica- tions that demonstrate the efficiency and effectiveness of our approach in terms of algorithm complexity, makespan, cost, system-level efficiency, fairness, and other aspects compared with other related algorithms. I. I NTRODUCTION Distributed computing systems such as clouds and grids have evolved towards a worldwide infrastructure providing dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities. To program such a large and scalable infrastructure, loosely coupled-based coordina- tion models of legacy software components such as bags- of-tasks (BoT) and workflows have emerged as successful programming paradigms in the scientific community. One of the most challenging NP-complete problems that researchers try to address is how to schedule large-scale scientific applications to distributed and heterogeneous re- sources such that certain objective functions such as total execution time (called from hereafter makespan) in aca- demic Grids or economic cost (in short cost from hereon) in business or market-oriented clouds are optimized, and certain execution constraints such as communication cost and storage requirements are considered and fulfilled. From the end-users’ perspective, both minimizing cost or execution time are preferred functionalities, whereas from the sys- tem’s perspective system-level efficiency and fairness can be considered as a good motivation such that the applications with more amount of computation should be allocated with more resources. Currently, only a few schemes can deal with both perspectives, such as optimizing user objectives (e.g. makespan, cost) while fulfilling other constraints, and providing a good efficiency and fairness to all users. On the other hand, many applications can generate huge data sets in a relatively short time, such as the Large Hadron Collider expected to produce 5 - 6 petabytes of data per year, which must be accommodated and efficiently handled through ap- propriate scheduling bandwidth and storage constraints. In this paper, we address these issues by proposing a communication and storage-aware multi-objective scheduling scheme for an important class of applications characterized by large sets of independent and homogeneous tasks, inter- connected through control flow and data flow dependencies, as follows: (1) multi-objective scheduling minimizes the expected execution time and economic cost of applications based on a sequential cooperative game theoretic algorithm, and (2) communication and storage-aware scheduling min- imizes the makespan and cost of applications while taking into account their bandwidth and storage constraints for transferring the produced data. The main advantages of our game theoretic algorithm are its faster convergence by using competitors and environment information to determine the most promising search direction by creating logical move- ments, its minimum requirements regarding the problem formulation, and its easy customisation to for new objectives. We compare the performance of our approach with six related heuristics and show that, for the applications with large BoTs, our algorithm is superior in complexity (orders-of-magnitude improvement), quality of result (optimal in certain known cases), system-level efficiency and fairness. The paper is structured as follows. Section II reviews the most relevant related work. Motivated by real-world applications and real heterogeneous computing testbeds, we introduce in Section III the application and the hybrid cloud computing models, followed by the paper’s problem defini- tion. Section IV describes the communication and storage- aware multi-objective algorithm in detail. In Section V, we validate and compare our algorithm against related methods through simulated and real-world experiments in a hybrid cloud environment. Section VI concludes the paper and discusses some future work. II. RELATED WORK Several researchers in performance-oriented distributed computing have focused on system-level load balancing [5], [17] or resource allocation [6], [14], aiming to introduce economic and game theoretic aspects into computational questions. Penmatsa et al. [17] formulated the scheduling problem as a cooperative game where Grid sites try to minimize the expected response time of tasks, while Kwok et al. [14] investigated the impact of selfish behaviors of individ- ual machine by taking into account the non-cooperativeness of machines. Ghosh et al. [8] proposed a strategy that formu- lates an incomplete information, alternating-offers bargaining