Analysing Quality of Resilience in Fish4Knowledge Video Analysis Workflows Gayathri Nadarajan, Cheng-Lin Yang, Yun-Heh Chen-Burger School of Informatics, University of Edinburgh, UK Rafael Tolosana-Calasanz Departamento de Inform´ atica e Ingenier´ ıa de Sistemas Universidad de Zaragoza, Spain Omer F. Rana School of Computer Science & Informatics Cardiff University, UK Abstract—The Fish4Knowledge (F4K) project involves analysing video generated from multiple camera feeds to support environmental and ecological assessment. A workflow engine is utilised in the project which deals with on-demand user queries and batch queries, selection of a suitable com- puting platform on which to enact the workflow along with a selection of suitable software modules to use to support analysis. A workflow monitor is also made use of, which handles the seamless execution and error monitoring of jobs on a heterogeneous computing platform. End users of such workflow generally include marine biologists, who are often primarily interested in the accuracy, performance and resilience of the workflows they execute. We describe how such users can be provided with possible workflow alternatives that trade off these three characteristics, based on previously recorded (his- torical) data. We describe a Quality of Resilience (QoR) metric that can be associated with multiple workflow alternatives, that enable such users to make more informed decisions about which alternative to choose. Keywords-Scientific Workflows, Fault Tolerance I. I NTRODUCTION &MOTIVATION Representing a scientific application as a workflow has a number of benefits – it enables re-use of services across a number of different applications and enables developers to make available their services across a number of possible computational infrastructures. By combining services to compose applications in this way also benefits application end users, who are able to better understand the various components that make up their application and subsequently update these as the infrastructure they use changes, or their own understanding of the problem changes. Resources over which such workflows are enacted can range in granularity from a single machine to multiple clusters, file systems and more recently Cloud-based deployments. More recently, Cloud technologies are rapidly being incorporated into tra- ditional High Performance Computing (HPC) infrastructures as a way to satisfy flexibility and adaptability requirements of a number of scientific applications – due to on-demand access to storage and computational facilities made available through Cloud systems. The availability of such distributed resources provide unique opportunities for building large- scale, complex environments for enacting workflows across these hybrid infrastructures, However, the use of such infras- tructures also makes the enactment process more error prone with significant heterogeneity in the types and distribution of failures – measured in terms of fault frequency, severity or behaviour. Resource reliability can also vary within and across different infrastructures, which may appear as failures to the workflow enactor. For instance, a tightly coupled HPC environment is often considered to be less error prone, compared to a distributed environment over which a user (or administrator) may have lesser control. Samak et al. [7] identify that the time to solution of scien- tific applications (and workflows) depends on the efficiency of algorithms, efficiency of the resources executing the algo- rithms and the time required for data movement. Although their main focus has been on improving the efficiency of resources by investigating failure rates and reliability of job execution, we emphasise that such efficiency requirements can exist at different levels of the application execution stack (from application design (i.e. an abstract workflow) to its enactment and execution on computing resources). Our focus is therefore to consider a wider view encompassing the notion of workflow “resilience”, which is investigated in the context of the F4K application in this paper. In particular, we describe how Quality of Resilience (QoR) can help users to make more informed decisions about which workflow configuration to choose. A. Quality of Resilience In [14], we define the notion of “Quality of Resilience” (QoR) – a metric that identifies how resilient a given work- flow is likely to be prior to its enactment. While workflow Quality of Service (QoS) tries to characterise service levels such as performance or cost from a users’ perspective, and to monitor and to maintain the agreed service levels, QoR aims at specifying workflow resilience from three different perspectives: user (QoR U ), workflow enactor (QoR E ) and resource manager (QoR R ). We assume that a user running a workflow is primarily interested in a submit-and-forget mode of operation – i.e. where a workflow is submitted to an enactment engine often subject to a number of different constraints, also identified by the user (such as execution time, financial cost, etc.). Whereas significant work in work- flow enactment has focused on performance (often measured as the workflow makespan) and the associated Quality of Service (QoS) metrics, limited attention has been given to 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing 978-0-7695-5152-4/13 $26.00 © 2013 IEEE DOI 10.1109/UCC.2013.81 405 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing CFP13UCC-USB/13 $26.00 © 2013 IEEE DOI 10.1109/UCC.2013.81 405 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing CFP13UCC-USB/13 $26.00 © 2013 IEEE DOI 10.1109/UCC.2013.81 405