1 Enabling Scalable Scientific Workflow Management in the Cloud Yong Zhao a , Youfu Li a , Ioan Raicu b , Shiyong Lu c , Wenhong Tian a , Heng Liu a a School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China b Department of Computer Science, Illinois Institute of Technology, Chicago IL, USA c Department of Computer Science, Wayne State University, Detroit MI, USA Corresponding author: E-mail address: youfuli.fly@gmail.com (Youfu Li). Abstract Cloud computing is gaining tremendous momentum in both academia and industry. In this context, we define the term “Cloud Workflow” as the specification, execution and provenance tracking of large-scale scientific workflows, as well as the management of data and computing resources to support the execution of large-scale scientific workflows in the Cloud. In this paper, we first analyze the gap between these two complementary technologies, and what it means to bring Clouds and workflows together. Then, we present the key challenges in supporting Cloud workflows, and present our reference framework for scientific workflow management in the Cloud. Last we present our experience in integrating a scientific workflow management system - Swift into the Cloud. We discuss the performance of cluster provisioning within the OpenNebula Cloud platform, the Eucalyptus Cloud platform and Amazon EC2, and we demonstrate the capability and efficiency of the integration using a NASA MODIS image processing workflow and the Montage image mosaic workflow. Note to Practitioners Scientific workflow management plays a very important role for scientific computing and application coordination, while Cloud computing offers scalability and resource on-demand. We devise autonomous methods to integrate scientific workflow management systems with Cloud platforms and also provision resources for large scale workflows, which can facilitate scientists to easily manage their workflows in the Cloud, and take advantage of large scale Cloud resources. There are a few integration options and many challenges in the process, and the experience we gain will help researchers in migrating their workflow management systems and workflow applications into the Cloud. Keywords Cloud workflow; Virtual Resource Provisioning; Cloud Resource Management; Scientific Computing Platform 1 Introduction Governments, research institutes, and industry leaders are strategically adopting Cloud computing [6], to solve their ever-increasing computing and storage problems arising in the Internet age. There has been a burgeoning of Cloud platforms and applications in both academia and industry: not long after Amazon opened its Elastic Computing Cloud (EC2) to the public, Google, IBM, and Microsoft all released their Cloud platforms one after another. Meanwhile, several open source Cloud platforms, such as Hadoop [30], OpenNebula [1], Eucalyptus [31], Nimbus [20], and OpenStack [2], become available with fast growth of their own communities. There are a couple of major benefits and advantages that are driving the widespread adoption of the Cloud computing paradigm: 1) Easy access to resources: resources are offered as services and can be accessed over the Internet. 2) Scalability on demand: once an application is deployed onto the Cloud, the application can be automatically made scalable by provisioning the resources in the Cloud on demand, and the capability of scaling out and in, and load balancing; 3) Better resource utilization: Cloud platforms can coordinate resource utilization according to resource demand of the applications hosted in the Cloud; and 4) Cost saving: Cloud users are charged based on their resource usage in the Cloud, they only pay for what they use, and if their applications get optimized, that will be reflected into a lowered cost immediately. In the meanwhile, scientific workflow has become an increasingly popular paradigm for scientists to formalize and structure complex scientific processes to enable and accelerate many significant scientific discoveries. A scientific workflow management system (SWFMS) is a system that supports the specification, modification, execution, failure handling, and monitoring of scientific workflows using workflow logic to control the order of executing workflow tasks. The importance of scientific workflows has been recognized and re-emphasized in a science article titled “Beyond the Data Deluge” [7], which concluded, “In the future, the rapidity with which any given discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies. , and Cloud workflow would cover both the workflow and Cloud aspects. The world has entered into a “big data” era. The amount of data created in the world is growing at an exponential rate. Advances in science instrumentation and network technologies are posing new challenges to our workflow systems in both data scale and application complexity. Scientific workflow systems have been formerly applied over a number of execution environments such as workstations, clusters/Grids, and supercomputers. In contrast to Cloud environment, running workflows in these environments are facing a series of obstacles when dealing with big data problems, in addition to data scale and computation complexity, there are also