Adapting Data-Intensive Workloads to Generic Allocation Policies in Cloud Infrastructures Ioannis Kitsos * , Antonis Papaioannou * , Nikos Tsikoudis * , and Kostas Magoutis Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH) Heraklion GR-70013, Greece {kitsos,papaioan,tsikudis,magoutis}@ics.forth.gr Abstract—Resource allocation policies in public Clouds are today largely agnostic to requirements that distributed applications have from their underlying infrastructure. As a result, assumptions about data-center topology that are built-into distributed data-intensive applications are often violated, impacting performance and availability goals. In this paper we describe a management system that discovers a limited amount of information about Cloud allocation decisions –in particular VMs of the same user that are collocated on a physical machine— so that data-intensive applications can adapt to those decisions and achieve their goals. Our distributed discovery process is based on either application-level techniques (measurements) or a novel lightweight and privacy-preserving Cloud management API proposed in this paper. Using the distributed Hadoop file system as a case study we show that VM collocation in a Cloud setup occurs in commercial platforms and that our methodologies can handle its impact in an effective, practical, and scalable manner. Keywords: Cloud management; distributed data-intensive applications. I. INTRODUCTION * Cloud computing [3] has become a dominant model of infrastructural services replacing or complementing traditional data centers and providing distributed applications with the resources that they need to serve Internet-scale workloads. The Cloud computing model is centered on the use of virtualization technologies (virtual machines (VMs), networks, and disks) to reap the benefits of statistical multiplexing on a shared infrastructure. However, these technologies also obstruct the details of the infrastructure from applications. While this is not an issue for some applications, others require knowledge of the details of the infrastructure for reasons of performance and availability. Several distributed data-intensive applications [6][7][10] replicate data onto different system nodes for availability. To truly achieve availability though, nodes holding replicas of the same data block must have independent failure modes. To decide which nodes are thus suitable to hold a given set of replicas, an application must have knowledge about data center topology, which is often hidden from applications or provided only in coarse form. Data center applications often try to * The authors are listed in alphabetical order. 978-1-4673-0269-2/12/$31.00 ©2012 IEEE reverse-engineer data center structure by assuming a simple model of infrastructure (e.g., racks of servers interconnected via network switches) and attempt to derive deployment details (i.e., whether two VMs are placed on the same physical machine, same rack, or across racks) via network addressing information. Most current Cloud-infrastructure services however completely virtualize network resources (i.e., there is no discernible mapping between virtual network assignments and physical network structure), thus effectively hiding their resource allocation decisions from applications. Cloud management systems today use generic allocation algorithms such as round-robin across servers or round-robin across racks with the intent to spread a user’s VM allocation as much as possible across the infrastructure. Two VMs however can still end up being collocated because either the allocation policy may in some cases favor this (e.g., when using large core-count systems such as Intel’s Single-Chip Cloud (SCC) [9]) or because of limited choices available to the allocation algorithm at different times. While collocation may be desirable for the high data throughput available between the VMs, it is in general undesirable when an application wants to decouple VMs with regards to physical machine or network failure for the purpose of achieving high availability. One option to provide applications with awareness of Cloud resource-allocation policies is to place them inside the Cloud and offer them as (what is popularly called) a platform-as-a- service (or PaaS). This solution requires a close relationship with the Cloud infrastructure provider and is thus not an option for the average application developer. Another option is to extend Cloud resource-allocation policies with application awareness [12]. This solution ensures that application requirements will be taken into account to some extent when allocating resources. However deployment of the solution requires significant changes to current Cloud management systems and is thus hard to deploy. Finally, another approach proposed in this paper is to infer or be explicitly provided with a limited amount of information from the underlying Cloud. Adaptation mechanisms often built into distributed data- intensive applications (such as data migration) can leverage this information to adapt to a Cloud’s generic allocation policies. We believe that this approach is simpler and thus easier to deploy compared to the other alternatives. The two approaches we explore in this paper are: (a) the black-box approach, namely to discover VM collocation via the