J Grid Computing (2012) 10:521–552 DOI 10.1007/s10723-012-9227-2 A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds Daniel de Oliveira · Kary A. C. S. Ocaña · Fernanda Baião · Marta Mattoso Received: 30 September 2011 / Accepted: 9 August 2012 / Published online: 25 August 2012 © Springer Science+Business Media B.V. 2012 Abstract In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experi- ments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of computational re- sources, thus requiring the usage of parallel tech- niques and high performance computing (HPC) environments. Meanwhile, clouds have emerged as a new paradigm where resources are virtual- ized and provided on demand. By using clouds, scientists have expanded beyond single parallel D. de Oliveira (B ) · K. A. C. S. Ocaña · M. Mattoso Federal University of Rio de Janeiro - COPPE/UFRJ, P.O. Box 68511, 21941-972 Rio de Janeiro, RJ, Brazil e-mail: danielc@cos.ufrj.br K. A. C. S. Ocaña e-mail: kary@cos.ufrj.br M. Mattoso e-mail: marta@cos.ufrj.br F. Baião Federal University of the State of Rio de Janeiro – UNIRIO, Rio de Janeiro, RJ, Brazil F. Baião e-mail: fernanda.baiao@uniriotec.br computers to hundreds or even thousands of vir- tual machines. Although the initial focus of clouds was to provide high throughput computing, clouds are already being used to provide an HPC envi- ronment where elastic resources can be instanti- ated on demand during the course of a scientific workflow. However, this model also raises many open, yet important, challenges such as scheduling workflow activities. Scheduling parallel scientific workflows in the cloud is a very complex task since we have to take into account many different crite- ria and to explore the elasticity characteristic for optimizing workflow execution. In this paper, we introduce an adaptive scheduling heuristic for par- allel execution of scientific workflows in the cloud that is based on three criteria: total execution time (makespan), reliability and financial cost. Besides scheduling workflow activities based on a 3-objective cost model, this approach also scales resources up and down according to the restric- tions imposed by scientists before workflow exe- cution. This tuning is based on provenance data captured and queried at runtime. We conducted a thorough validation of our approach using a real bioinformatics workflow. The experiments were performed in SciCumulus, a cloud workflow en- gine for managing scientific workflow execution. Keywords Cloud computing · Scientific workflow · Scientific experiment · Provenance