Augmented Petri Net Cost Model for Optimisation of Large Bioinformatics
Workflows Using Cloud
Zheng Xie, Liangxiu Han
School of Computing, Mathematics and Digital
Technology
Manchester Metropolitan University
Manchester M1 5GD, UK
Z.Xie@mmu.ac.uk , L.Han@mmu.ac.uk
Richard Baldock
MRC Human Genetics Unit MRC IGMM,
University of Edinburgh Western General Hospital
Edinburgh EH4 2XU, UK
richard.baldock@igmm.ed.ac.uk
Abstract—This paper concerns the trade-off that may be made
between the cost of storing intermediate data and the
computing costs incurred in regenerating this data when large
bioinformatics or other workflows are implemented using
cloud resources. The implementation may be required to
delete some data to keep storage costs within a budget, and
deciding how best to do this with minimal increase in
computing costs can cause complex problems. To address
these problems, a modified form of Petri net is introduced for
modeling the workflow and allowing an optimization algorithm
to be applied for addressing several types of problem that may
arise. The proposed ‘augmented Petri-net’ simulates work-
flows with cost models included, thus providing a platform for
an optimization procedure. Illustrations are presented to show
that such optimization can achieve overall cost reductions in a
number of different scenarios.
Keywords-Petri Net; bioinformatics workflow; Cloud; cost
model.
I. INTRODUCTION
A trade-off may be made between the cost of storing
intermediate data and the computing costs incurred in
regenerating this data when workflows are implemented
using limited resources. At certain times during the
workflow it may be advantageous to delete some data to
keep storage costs within a budget, though this may incur
subsequent additional computation cost in regenerating the
data. Deciding how best to do this to achieve a reduction in
overall costs can cause complex problems. These problems
become especially interesting and important when cloud
computing and storage resources are being used to
implement extremely large work-flows as are now being
encountered in bioinformatics. To address these problems,
a modified form of Petri net is proposed for modeling the
workflow, and an optimization approach is proposed for
solving several types of problem that may arise.
Petri nets [2] [3] are widely used for modeling or
simulating concurrent systems. Different forms of Petri-
nets have been proposed for taking into account timing and
computing costs [8] [9]. Thus, the accumulated costs
performing transitions and storing the resulting tokens can
be calculated and used as the bases of optimization
procedures for minimizing costs. An example of such a
generalization of classic Petri nets is the Priced Timed Petri
Net (PTPN) [1] which incorporates event timing, real-time
constraints, and computation costing. In a PTPN, each
token has a clock which indicates its age. Transitions occur
according to the normal mechanisms of Petri nets, but links
between places and transitions are labeled with time-
intervals to define constraints on the token clocks. When
transitions fire, tokens are removed from or added to places
only when their ages fall within the time-intervals of the
appropriate links. A cost function is applied to each
transition and the storage of tokens. The sum of the costs of
all fired transitions and the costs for storing the required
tokens in the required places for the required durations of
time becomes the cost of implementing the system.
A modified form of PTPN is now proposed for
implementing workflows as exemplified by the training of a
classifier for identifying anatomical components within
images [11]. Such a classifier is required for the ontological
annotation of gene expressions within mouse embryo
images. The annotation is required for identifying gene
interactions and networks that are associated with
developmental and physiological functions in the embryos.
It requires the labeling of embryo images produced from
RNA in situ Hybridization (ISH) with appropriate terms
from anontology. The application is based on the
EuExpress-II project [10] which delivered partially
annotated images encompassing over 20 Terabytes of data.
To automatically annotate these images, three stages are
required. Firstly, at the training stage, a classification model
has to be built based on training image datasets with
annotations. Secondly, at the testing stage, the performance
of the classification model has to be tested and evaluated.
Finally, at the deployment stage, the model has to be
deployed to perform the classification of all non-annotated
images. This paper is based on the training stage. The
processing required includes the integration of images and
annotations, digital signal processing to eliminate noise and
artifacts, feature generation, feature selection and extraction,
and classifier design. This task is an instance of a generic
2013 European Modelling Symposium
978-1-4799-2578-0/13 $31.00 © 2013 IEEE
DOI
191
2013 European Modelling Symposium
978-1-4799-2578-0/13 $31.00 © 2013 IEEE
DOI 10.1109/EMS.2013.35
191
2013 European Modelling Symposium
978-1-4799-2578-0/13 $31.00 © 2013 IEEE
DOI 10.1109/EMS.2013.35
201