Analysing Quality of Resilience in Fish4Knowledge Video Analysis Workflows
Gayathri Nadarajan,
Cheng-Lin Yang, Yun-Heh Chen-Burger
School of Informatics,
University of Edinburgh, UK
Rafael Tolosana-Calasanz
Departamento de Inform´ atica
e Ingenier´ ıa de Sistemas
Universidad de Zaragoza, Spain
Omer F. Rana
School of Computer Science
& Informatics
Cardiff University, UK
Abstract—The Fish4Knowledge (F4K) project involves
analysing video generated from multiple camera feeds to
support environmental and ecological assessment. A workflow
engine is utilised in the project which deals with on-demand
user queries and batch queries, selection of a suitable com-
puting platform on which to enact the workflow along with
a selection of suitable software modules to use to support
analysis. A workflow monitor is also made use of, which handles
the seamless execution and error monitoring of jobs on a
heterogeneous computing platform. End users of such workflow
generally include marine biologists, who are often primarily
interested in the accuracy, performance and resilience of the
workflows they execute. We describe how such users can be
provided with possible workflow alternatives that trade off
these three characteristics, based on previously recorded (his-
torical) data. We describe a Quality of Resilience (QoR) metric
that can be associated with multiple workflow alternatives,
that enable such users to make more informed decisions about
which alternative to choose.
Keywords-Scientific Workflows, Fault Tolerance
I. I NTRODUCTION &MOTIVATION
Representing a scientific application as a workflow has a
number of benefits – it enables re-use of services across a
number of different applications and enables developers to
make available their services across a number of possible
computational infrastructures. By combining services to
compose applications in this way also benefits application
end users, who are able to better understand the various
components that make up their application and subsequently
update these as the infrastructure they use changes, or their
own understanding of the problem changes. Resources over
which such workflows are enacted can range in granularity
from a single machine to multiple clusters, file systems
and more recently Cloud-based deployments. More recently,
Cloud technologies are rapidly being incorporated into tra-
ditional High Performance Computing (HPC) infrastructures
as a way to satisfy flexibility and adaptability requirements
of a number of scientific applications – due to on-demand
access to storage and computational facilities made available
through Cloud systems. The availability of such distributed
resources provide unique opportunities for building large-
scale, complex environments for enacting workflows across
these hybrid infrastructures, However, the use of such infras-
tructures also makes the enactment process more error prone
with significant heterogeneity in the types and distribution
of failures – measured in terms of fault frequency, severity
or behaviour. Resource reliability can also vary within and
across different infrastructures, which may appear as failures
to the workflow enactor. For instance, a tightly coupled
HPC environment is often considered to be less error prone,
compared to a distributed environment over which a user (or
administrator) may have lesser control.
Samak et al. [7] identify that the time to solution of scien-
tific applications (and workflows) depends on the efficiency
of algorithms, efficiency of the resources executing the algo-
rithms and the time required for data movement. Although
their main focus has been on improving the efficiency of
resources by investigating failure rates and reliability of job
execution, we emphasise that such efficiency requirements
can exist at different levels of the application execution
stack (from application design (i.e. an abstract workflow) to
its enactment and execution on computing resources). Our
focus is therefore to consider a wider view encompassing the
notion of workflow “resilience”, which is investigated in the
context of the F4K application in this paper. In particular,
we describe how Quality of Resilience (QoR) can help users
to make more informed decisions about which workflow
configuration to choose.
A. Quality of Resilience
In [14], we define the notion of “Quality of Resilience”
(QoR) – a metric that identifies
how resilient a given work-
flow is likely to be prior to its enactment. While workflow
Quality of Service (QoS) tries to characterise service levels
such as performance or cost from a users’ perspective, and
to monitor and to maintain the agreed service levels, QoR
aims at specifying workflow resilience from three different
perspectives: user (QoR
U
), workflow enactor (QoR
E
) and
resource manager (QoR
R
). We assume that a user running
a workflow is primarily interested in a submit-and-forget
mode of operation – i.e. where a workflow is submitted to
an enactment engine often subject to a number of different
constraints, also identified by the user (such as execution
time, financial cost, etc.). Whereas significant work in work-
flow enactment has focused on performance (often measured
as the workflow makespan) and the associated Quality of
Service (QoS) metrics, limited attention has been given to
2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
978-0-7695-5152-4/13 $26.00 © 2013 IEEE
DOI 10.1109/UCC.2013.81
405
2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
CFP13UCC-USB/13 $26.00 © 2013 IEEE
DOI 10.1109/UCC.2013.81
405
2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
CFP13UCC-USB/13 $26.00 © 2013 IEEE
DOI 10.1109/UCC.2013.81
405