Analysing Quality of Resilience in Fish4Knowledge Video Analysis Workﬂows Gayathri Nadarajan, Cheng-Lin Yang, Yun-Heh Chen-Burger School of Informatics, University of Edinburgh, UK Rafael Tolosana-Calasanz Departamento de Inform´ atica e Ingenier´ ıa de Sistemas Universidad de Zaragoza, Spain Omer F. Rana School of Computer Science & Informatics Cardiff University, UK Abstract—The Fish4Knowledge (F4K) project involves analysing video generated from multiple camera feeds to support environmental and ecological assessment. A workﬂow engine is utilised in the project which deals with on-demand user queries and batch queries, selection of a suitable com- puting platform on which to enact the workﬂow along with a selection of suitable software modules to use to support analysis. A workﬂow monitor is also made use of, which handles the seamless execution and error monitoring of jobs on a heterogeneous computing platform. End users of such workﬂow generally include marine biologists, who are often primarily interested in the accuracy, performance and resilience of the workﬂows they execute. We describe how such users can be provided with possible workﬂow alternatives that trade off these three characteristics, based on previously recorded (his- torical) data. We describe a Quality of Resilience (QoR) metric that can be associated with multiple workﬂow alternatives, that enable such users to make more informed decisions about which alternative to choose. Keywords-Scientiﬁc Workﬂows, Fault Tolerance I. I NTRODUCTION &MOTIVATION Representing a scientiﬁc application as a workﬂow has a number of beneﬁts – it enables re-use of services across a number of different applications and enables developers to make available their services across a number of possible computational infrastructures. By combining services to compose applications in this way also beneﬁts application end users, who are able to better understand the various components that make up their application and subsequently update these as the infrastructure they use changes, or their own understanding of the problem changes. Resources over which such workﬂows are enacted can range in granularity from a single machine to multiple clusters, ﬁle systems and more recently Cloud-based deployments. More recently, Cloud technologies are rapidly being incorporated into tra- ditional High Performance Computing (HPC) infrastructures as a way to satisfy ﬂexibility and adaptability requirements of a number of scientiﬁc applications – due to on-demand access to storage and computational facilities made available through Cloud systems. The availability of such distributed resources provide unique opportunities for building large- scale, complex environments for enacting workﬂows across these hybrid infrastructures, However, the use of such infras- tructures also makes the enactment process more error prone with signiﬁcant heterogeneity in the types and distribution of failures – measured in terms of fault frequency, severity or behaviour. Resource reliability can also vary within and across different infrastructures, which may appear as failures to the workﬂow enactor. For instance, a tightly coupled HPC environment is often considered to be less error prone, compared to a distributed environment over which a user (or administrator) may have lesser control. Samak et al. [7] identify that the time to solution of scien- tiﬁc applications (and workﬂows) depends on the efﬁciency of algorithms, efﬁciency of the resources executing the algo- rithms and the time required for data movement. Although their main focus has been on improving the efﬁciency of resources by investigating failure rates and reliability of job execution, we emphasise that such efﬁciency requirements can exist at different levels of the application execution stack (from application design (i.e. an abstract workﬂow) to its enactment and execution on computing resources). Our focus is therefore to consider a wider view encompassing the notion of workﬂow “resilience”, which is investigated in the context of the F4K application in this paper. In particular, we describe how Quality of Resilience (QoR) can help users to make more informed decisions about which workﬂow conﬁguration to choose. A. Quality of Resilience In [14], we deﬁne the notion of “Quality of Resilience” (QoR) – a metric that identiﬁes how resilient a given work- ﬂow is likely to be prior to its enactment. While workﬂow Quality of Service (QoS) tries to characterise service levels such as performance or cost from a users’ perspective, and to monitor and to maintain the agreed service levels, QoR aims at specifying workﬂow resilience from three different perspectives: user (QoR U ), workﬂow enactor (QoR E ) and resource manager (QoR R ). We assume that a user running a workﬂow is primarily interested in a submit-and-forget mode of operation – i.e. where a workﬂow is submitted to an enactment engine often subject to a number of different constraints, also identiﬁed by the user (such as execution time, ﬁnancial cost, etc.). Whereas signiﬁcant work in work- ﬂow enactment has focused on performance (often measured as the workﬂow makespan) and the associated Quality of Service (QoS) metrics, limited attention has been given to 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing 978-0-7695-5152-4/13 $26.00 © 2013 IEEE DOI 10.1109/UCC.2013.81 405 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing CFP13UCC-USB/13 $26.00 © 2013 IEEE DOI 10.1109/UCC.2013.81 405 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing CFP13UCC-USB/13 $26.00 © 2013 IEEE DOI 10.1109/UCC.2013.81 405