Toward a Cloud Operating System Fabio Pianese Peter Bosch Alessandro Duminuco Nico Janssens Thanos Stathopoulos † Moritz Steiner Alcatel-Lucent Bell Labs Service Infrastructure Research Dept. †Computer Systems and Security Dept. {firstname.lastname}@alcatel-lucent.com Abstract—Cloud computing is characterized today by a hotch- potch of elements and solutions, namely operating systems run- ning on a single virtualized computing environment, middleware layers that attempt to combine physical and virtualized resources from multiple operating systems, and specialized application engines that leverage a key asset of the cloud service provider (e.g. Google’s BigTable). Yet, there does not exist a virtual distributed operating system that ties together these cloud resources into a unified processing environment that is easy to program, flexible, scalable, self-managed, and dependable. In this position paper, we advocate the importance of a virtual distributed operating system, a Cloud OS, as a catalyst in unlock- ing the real potential of the Cloud—a computing platform with seemingly infinite CPU, memory, storage and network resources. Following established Operating Systems and Distributed Sys- tems principles laid out by UNIX and subsequent research efforts, the Cloud OS aims to provide simple programming abstractions to available cloud resources, strong isolation techniques between Cloud processes, and strong integration with network resources. At the same time, our Cloud OS design is tailored to the challenging environment of the Cloud by emphasizing elasticity, autonomous decentralized management, and fault tolerance. I. I NTRODUCTION The computing industry is radically changing the scale of its operations. While a few years ago typical deployed systems consisted of individual racks filled with few tens of computers, today’s massive computing infrastructures are composed of multiple server farms, each built inside carefully engineered data centers that may host several tens of thousand CPU cores in extremely dense and space-efficient layouts [1]. There are several reasons for this development: • Significant economies of scale in manufacturing and purchasing huge amounts of off-the-shelf hardware parts. • Remarkable savings in power and cooling costs from the massive pooling of computers in dedicated facilities. • Hardware advances that have made the use of system virtualization techniques viable and attractive. • Commercial interest for a growing set of applications and services to be offloaded “into the Cloud”. The commoditization of computing is thus transforming pro- cessing, storage, and bandwidth into utilities such as electrical power, water, or telephone access. This process is already well under way, as today businesses of all sizes tend to outsource their computing infrastructures, often turning to external providers to fulfill their entire operational IT needs. The migration of services and applications into the network is also modifying how computing is perceived in the mainstream, turning what was once felt as the result of concrete equipment and processes into an abstract entity, devoid of any physical connotation: this is what the expression “Cloud computing” is currently describing. Previous research has successfully investigated the viability of several approaches to managing large-scale pools of hard- ware, users, processes, and applications. The main concerns of these efforts were twofold: on one hand, exploring the technical issues such as the scalability limits of management techniques; on the other hand, understanding the real-world and “systemic” concerns such as ease of deployment and expressiveness of the user and programming interface. Our main motivation lies in the fact that state-of-the-art management systems available today do not provide access to the Cloud in a uniform and coherent way. They either attempt to expose all the low-level details of the underlying pieces of hardware [2] or reduce the Cloud to a mere set of API calls— to instantiate and remotely control resources [3][4][5][6], to provide facilities such as data storage, CDN/streaming, and event queues [7][8][9], or to make available distributed computing library packages [10][11][12]. Yet, a major gap still has to be bridged in order to bond the Cloud resources into one unified processing environment that is easy to program, flexible, scalable, self-managing, and dependable. In this position paper, we argue for a holistic approach to Cloud computing that transcends the limits of individual ma- chines. We aim to provide a uniform abstraction—the Cloud Operating System—that adheres to well-established operating systems conventions, namely: (a) providing a simple and yet expressive set of Cloud metrics that can be understood by the applications and exploited according to individual policies and requirements, and (b) exposing a coherent and unified programming interface that can leverage the available network, CPU, and storage as the pooled resources of a large-scale distributed Cloud computer. In Section II we elaborate our vision of a Cloud operat- ing system, discuss our working assumptions, and state the requirements we aim to meet. In Section III we present a set of elements and features that we see as necessary in Cloud OS: distributed resource measurement and management techniques, resource abstraction models, and interfaces, both to the underlying hardware and to the users/programmers. We then briefly review the related work in Section IV, and conclude in Section V with our plans for the future.