Early Experience with the Distributed Nebula Cloud * Pradeep Sundarrajan†, Abhishek Gupta, Matthew Ryden†, Rohit Nair, Abhishek Chandra and Jon Weissman Computer Science and Engineering University of Minnesota Minneapolis, MN 55455 {abgupta, rnair, chandra, jon}@cs.umn.edu, †{sunda055, ryde0028}@umn.edu Current cloud infrastructures are important for their ease of use and performance. However, they suﬀer from several shortcomings. The main problem is ineﬃcient data mobil- ity due to the centralization of cloud resources. We believe such clouds are highly unsuited for dispersed-data-intensive applications, where the data may be spread at multiple ge- ographical locations (e.g., distributed user blogs). Instead, we propose a new cloud model called Nebula: a dispersed, context-aware, and cost-eﬀective cloud. We provide experi- mental evidence for the need for Nebulas using a distributed blog analysis application followed by the system architecture and components of our system. 1. INTRODUCTION The emergence of cloud computing has revolutionized com- puting and software usage through infrastructure and ser- vice outsourcing, and its pay-per-use model. Several cloud services such as Amazon EC2, Google AppEngine, and Mi- crosoft Azure are being used to provide a multitude of ser- vices: long-term state and data storage [27], “one-shot” burst of computation [18], and interactive end user-oriented ser- vices [17]. The cloud paradigm has the potential to free scientists and businesses from management and deployment issues for their applications and services, leading to higher productivity and more innovation in scientiﬁc, commercial, and social arenas. In the cloud computing domain, a cloud signiﬁes a service provided to a user that hides details of the actual location of the infrastructure resources from the user. Most currently deployed clouds [12, 16, 19, 4] are built on a well-provisioned and well-managed infrastructure, such as a data center, that provides resources and services to users. The underlying infrastructure is typically owned and man- aged by the cloud provider (e.g., Amazon, IBM, Google, Microsoft, etc.), while the user pays a certain price for their resource use. While the current cloud infrastructures are important for * The authors would like to acknowledge grant NSF/IIS- 0916425 that supported this research Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. DIDC’11, June 8, 2011, San Jose, California, USA. Copyright 2011 ACM 978-1-4503-0704-8/11/06 ...$10.00. their ease of use and performance, they suﬀer from several shortcomings. The ﬁrst problem is ineﬃcient data mobility: many of the current clouds are largely centralized within a few datacenters, and are highly unsuited for dispersed-data- intensive applications, where the data may be spread at mul- tiple geographical locations (e.g., distributed user blogs). In such cases, moving the data to a centralized cloud location before it can be processed can be prohibitively expensive, both in time and cost [3], particularly if the data is chang- ing dynamically. Thus existing clouds have largely been lim- ited to services with limited data movement requirements, or those that utilize data already inside the cloud (e.g. S3 public datasets). Secondly, current clouds suﬀer from lack of user context: current cloud models, due to their centralized nature, are largely separated from user context and location. Finally, current cloud models also are expensive for many de- velopers and application classes. Even though many of these infrastructures charge on a pay-per-use basis, the costs can be substantial for any long-term service deployment (such examples are provided in [26, 29]). Such costs may be bar- riers to entry for certain classes of users, particularly those that can tolerate reduced service levels, including researchers requiring long-running large-scale computations, and small entrepreneurs wishing to test potential applications at scale. To overcome the shortcomings of current cloud models, we propose a new cloud model [5] called Nebula: a dis- persed, context-aware, and cost-eﬀective cloud. We envision a Nebula to support many of the applications and services supported by today’s clouds, however, it would diﬀer in im- portant ways in its design and implementation. First, a Nebula would have a highly dispersed set of computational and storage resources, so that it can exploit locality to data sources and knowledge about user context more easily. In particular, the Nebula will include edge machines, devices, and data sources as part of its resources. Secondly, most re- sources within a Nebula would be autonomous and loosely managed, reducing the need for costly administration and maintenance, thus making it highly cost-eﬀective. Note that we do not propose Nebulas as replacements to commercial clouds - rather, we believe Nebulas to be fully complemen- tary to commercial clouds and in some cases may represent a transition pathway. There are several classes of applications and services that we believe would beneﬁt greatly through deployment on such Nebulas [5]. In this paper, we focus on one such class, dis- persed data-intensive services. Such services rely on large amounts of dispersed data where moving data to a central- ized cloud can be prohibitively expensive and ineﬃcient in