Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach Thaylon Guedes 1 Leonardo A. Jesus 1 Kary A. C. S. Ocan ˜a 2 Lucia M. A. Drummond 1 Daniel de Oliveira 1 Received: 31 March 2018 / Revised: 20 December 2018 / Accepted: 23 February 2019 Ó Springer Science+Business Media, LLC, part of Springer Nature 2019 Abstract Scientific workflows are abstractions composed of activities, data and dependencies that model a computer simulation and are managed by complex engines named scientific workflow management system (SWfMS). Many workflows demand many computational resources once their executions may involve a number of different programs processing a massive volume of data. Thus, the use of high-performance computing (HPC) and data-intensive scalable computing environments allied to parallelization techniques provides the necessary support for the execution of such workflows. Clouds are environments that already offer HPC capabilities and workflows can explore them. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility in this environment. Thus, existing SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as Checkpoint/Restart, Re- Execution and Over-provisioning, but it is far from trivial to choose the suitable fault tolerance technique for a workflow execution that is not going to jeopardize the parallel execution. The major problem is that the suitable fault tolerance technique may be different for each workflow, activity or activation since programs associated with activities may present different behaviors. This article aims at analyzing several fault-tolerance techniques in a cloud-based SWfMS named SciCumulus, and recommend the suitable one for user’s workflow activities and activations using machine learning techniques and provenance data, thus aiming at improving resiliency. Keyword Cloud computing Scientific workflow Fault-tolerance Recommendation 1 Introduction Over the last decades the concept of scientific workflow has emerged as the most successful paradigm for modeling and executing experiments in many domains of science such as biology, Oil & Gas, chemistry and astronomy [15, 1921, 54, 55]. Scientific Workflows can be defined as abstractions used to define sequences of activities and data dependencies among them [20, 50]. Each activity in a workflow is associated with a program, script or Web service [50]. The execution of a specific activity with a set of parameter values is called an activation [56]. Workflows are usually modeled, executed and monitored by complex engines named Scientific Workflow Management Systems (SWfMS). Several existing workflows produce and con- sume large quantities of data and are usually executed repeatedly until an associated hypothesis can be confirmed or refuted (i.e.. large-scale workflows) [35]. With large scale datasets, parallel processing is essential to the per- formance of the workflow [35]. High Performance Com- puting (HPC) and/or Data-Intensive Scalable Computing (DISC) environments, such as well-known clusters and supercomputers [37] need to be used. In the last decade, clouds [70] also started to be used as HPC environments, since they are able to offer HPC resources in their (commonly huge) pool of resources. This study was financed in part by the Coordenac ¸a ˜o de Aperfeic ¸oamento de Pessoal de Nı ´vel Superior - Brasil (CAPES) - Finance Code 001. Authors would also like to thank CNPq and FAPERJ for partially sponsoring this research. & Daniel de Oliveira danielcmo@ic.uff.br Lucia M. A. Drummond lucia@ic.uff.br 1 Instituto de Computac ¸a ˜o - Universidade Federal Fluminense, Nitero ´i, Brazil 2 Laborato ´rio Nacional de Computac ¸a ˜o Cientı ´fica, Petro ´polis, Brazil 123 Cluster Computing https://doi.org/10.1007/s10586-019-02920-6