Augmented Petri Net Cost Model for Optimisation of Large Bioinformatics Workflows Using Cloud Zheng Xie, Liangxiu Han School of Computing, Mathematics and Digital Technology Manchester Metropolitan University Manchester M1 5GD, UK Z.Xie@mmu.ac.uk , L.Han@mmu.ac.uk Richard Baldock MRC Human Genetics Unit MRC IGMM, University of Edinburgh Western General Hospital Edinburgh EH4 2XU, UK richard.baldock@igmm.ed.ac.uk Abstract—This paper concerns the trade-off that may be made between the cost of storing intermediate data and the computing costs incurred in regenerating this data when large bioinformatics or other workflows are implemented using cloud resources. The implementation may be required to delete some data to keep storage costs within a budget, and deciding how best to do this with minimal increase in computing costs can cause complex problems. To address these problems, a modified form of Petri net is introduced for modeling the workflow and allowing an optimization algorithm to be applied for addressing several types of problem that may arise. The proposed ‘augmented Petri-net’ simulates work- flows with cost models included, thus providing a platform for an optimization procedure. Illustrations are presented to show that such optimization can achieve overall cost reductions in a number of different scenarios. Keywords-Petri Net; bioinformatics workflow; Cloud; cost model. I. INTRODUCTION A trade-off may be made between the cost of storing intermediate data and the computing costs incurred in regenerating this data when workflows are implemented using limited resources. At certain times during the workflow it may be advantageous to delete some data to keep storage costs within a budget, though this may incur subsequent additional computation cost in regenerating the data. Deciding how best to do this to achieve a reduction in overall costs can cause complex problems. These problems become especially interesting and important when cloud computing and storage resources are being used to implement extremely large work-flows as are now being encountered in bioinformatics. To address these problems, a modified form of Petri net is proposed for modeling the workflow, and an optimization approach is proposed for solving several types of problem that may arise. Petri nets [2] [3] are widely used for modeling or simulating concurrent systems. Different forms of Petri- nets have been proposed for taking into account timing and computing costs [8] [9]. Thus, the accumulated costs performing transitions and storing the resulting tokens can be calculated and used as the bases of optimization procedures for minimizing costs. An example of such a generalization of classic Petri nets is the Priced Timed Petri Net (PTPN) [1] which incorporates event timing, real-time constraints, and computation costing. In a PTPN, each token has a clock which indicates its age. Transitions occur according to the normal mechanisms of Petri nets, but links between places and transitions are labeled with time- intervals to define constraints on the token clocks. When transitions fire, tokens are removed from or added to places only when their ages fall within the time-intervals of the appropriate links. A cost function is applied to each transition and the storage of tokens. The sum of the costs of all fired transitions and the costs for storing the required tokens in the required places for the required durations of time becomes the cost of implementing the system. A modified form of PTPN is now proposed for implementing workflows as exemplified by the training of a classifier for identifying anatomical components within images [11]. Such a classifier is required for the ontological annotation of gene expressions within mouse embryo images. The annotation is required for identifying gene interactions and networks that are associated with developmental and physiological functions in the embryos. It requires the labeling of embryo images produced from RNA in situ Hybridization (ISH) with appropriate terms from anontology. The application is based on the EuExpress-II project [10] which delivered partially annotated images encompassing over 20 Terabytes of data. To automatically annotate these images, three stages are required. Firstly, at the training stage, a classification model has to be built based on training image datasets with annotations. Secondly, at the testing stage, the performance of the classification model has to be tested and evaluated. Finally, at the deployment stage, the model has to be deployed to perform the classification of all non-annotated images. This paper is based on the training stage. The processing required includes the integration of images and annotations, digital signal processing to eliminate noise and artifacts, feature generation, feature selection and extraction, and classifier design. This task is an instance of a generic 2013 European Modelling Symposium 978-1-4799-2578-0/13 $31.00 © 2013 IEEE DOI 191 2013 European Modelling Symposium 978-1-4799-2578-0/13 $31.00 © 2013 IEEE DOI 10.1109/EMS.2013.35 191 2013 European Modelling Symposium 978-1-4799-2578-0/13 $31.00 © 2013 IEEE DOI 10.1109/EMS.2013.35 201