Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach Thaylon Guedes 1 • Leonardo A. Jesus 1 • Kary A. C. S. Ocan ˜a 2 • Lucia M. A. Drummond 1 • Daniel de Oliveira 1 Received: 31 March 2018 / Revised: 20 December 2018 / Accepted: 23 February 2019 Ó Springer Science+Business Media, LLC, part of Springer Nature 2019 Abstract Scientiﬁc workﬂows are abstractions composed of activities, data and dependencies that model a computer simulation and are managed by complex engines named scientiﬁc workﬂow management system (SWfMS). Many workﬂows demand many computational resources once their executions may involve a number of different programs processing a massive volume of data. Thus, the use of high-performance computing (HPC) and data-intensive scalable computing environments allied to parallelization techniques provides the necessary support for the execution of such workﬂows. Clouds are environments that already offer HPC capabilities and workﬂows can explore them. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility in this environment. Thus, existing SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as Checkpoint/Restart, Re- Execution and Over-provisioning, but it is far from trivial to choose the suitable fault tolerance technique for a workﬂow execution that is not going to jeopardize the parallel execution. The major problem is that the suitable fault tolerance technique may be different for each workﬂow, activity or activation since programs associated with activities may present different behaviors. This article aims at analyzing several fault-tolerance techniques in a cloud-based SWfMS named SciCumulus, and recommend the suitable one for user’s workﬂow activities and activations using machine learning techniques and provenance data, thus aiming at improving resiliency. Keyword Cloud computing  Scientiﬁc workﬂow  Fault-tolerance  Recommendation 1 Introduction Over the last decades the concept of scientiﬁc workﬂow has emerged as the most successful paradigm for modeling and executing experiments in many domains of science such as biology, Oil & Gas, chemistry and astronomy [15, 19–21, 54, 55]. Scientiﬁc Workﬂows can be deﬁned as abstractions used to deﬁne sequences of activities and data dependencies among them [20, 50]. Each activity in a workﬂow is associated with a program, script or Web service [50]. The execution of a speciﬁc activity with a set of parameter values is called an activation [56]. Workﬂows are usually modeled, executed and monitored by complex engines named Scientiﬁc Workﬂow Management Systems (SWfMS). Several existing workﬂows produce and con- sume large quantities of data and are usually executed repeatedly until an associated hypothesis can be conﬁrmed or refuted (i.e.. large-scale workﬂows) [35]. With large scale datasets, parallel processing is essential to the per- formance of the workﬂow [35]. High Performance Com- puting (HPC) and/or Data-Intensive Scalable Computing (DISC) environments, such as well-known clusters and supercomputers [37] need to be used. In the last decade, clouds [70] also started to be used as HPC environments, since they are able to offer HPC resources in their (commonly huge) pool of resources. This study was ﬁnanced in part by the Coordenac ¸a ˜o de Aperfeic ¸oamento de Pessoal de Nı ´vel Superior - Brasil (CAPES) - Finance Code 001. Authors would also like to thank CNPq and FAPERJ for partially sponsoring this research. & Daniel de Oliveira danielcmo@ic.uff.br Lucia M. A. Drummond lucia@ic.uff.br 1 Instituto de Computac ¸a ˜o - Universidade Federal Fluminense, Nitero ´i, Brazil 2 Laborato ´rio Nacional de Computac ¸a ˜o Cientı ´ﬁca, Petro ´polis, Brazil 123 Cluster Computing https://doi.org/10.1007/s10586-019-02920-6