Whose Fault is It? Correctly Attributing Outages in Cloud Services Matteo Adriani Dpt. of Civil Engineering and Computer Science University of Rome Tor Vergata Rome, Italy matteo.adriani93@alice.it Maurizio Naldi Dpt. of Civil Engineering and Computer Science University of Rome Tor Vergata Rome, Italy Dpt. of Law, Economics, Politics and Modern Languages LUMSA University maurizio.naldi@uniroma2.it m.naldi@lumsa.it Abstract—Cloud availability is a major performance parame- ter in cloud Service Level Agreements (SLA). Its correct evalua- tion is essential to SLA enforcement and possible litigation issues. Current methods fail to correctly identify the fault location, since they include the network contribution. We propose a procedure to identify the failures actually due to the cloud itself and provide a correct cloud availability measure. The procedure employs tools that are freely available, i.e. traceroute and whois, and arrives at the availability measure by ﬁrst identifying the boundaries of the cloud. We evaluate our procedure by testing it on three major cloud providers: Google Cloud, Amazon AWS, and Rackspace. The results show that the procedure arrives at a correct identiﬁcation in 95% of cases. The cloud availability obtained in the test after correct identiﬁcation lies between 3 and 4 nines for the three platforms under test. I. I NTRODUCTION Availability is a major Quality of Service descriptor in cloud services, and an essential component of Service Level Agreements [1]–[3]. Many efforts have been devoted to understanding and im- proving the availability of cloud systems. The relevance of the issue has been re-stated very recently by Varghese and Buyya, which list it among the top research directions, mentioning the 49-minute outage suffered by Amazon, which cost the company more than $4 million in lost sales, as an indicator of the economic importance of achieving a high availability [4]. The same concept had been voiced in [5], where the authors even propose to consider a Reliability as a Service, where reliability is a parameter that users can specify and a service by itself, rather than the random state of a cloud-based service. Concerns for the legal implications that may arise due to a less- than-adequate cloud reliability have been recently expressed in [6]. An analysis of the main causes of cloud failures has been carried out in [7], where growth trends are also identiﬁed, and [8], where mechanisms are subsequently discussed to minimize the impact of outages. Some papers have focussed on the analysis of the cloud architecture to get a high availability by design [9]–[11]. A different approach has been taken in [12] and [13], where machine learning technique have been employed to predict cloud outages (and react accordingly). If we switch from the perspective of a cloud designer to that of a cloud user, the main interest lies in understanding if the cloud is performing up to the expectations. Setting up, or employing the services of, a cloud monitoring platform is essential in this respect. Several architectures have been proposed for that purpose, e.g. in [14]–[16], and a recent review is contained in [17]. Unfortunately, very few attempts have been done to actually measure cloud availability from a third party vantage point. An early attempt based on users’ reports has been reported in [18]. The shortcoming of that approach is that the starting time of the outage may not be reported correctly, since a time lag is always present between the time an outage occurs and the time a user ﬁrst reports it. The ending time of the outage may be also reported wrongly, since most users do not take on them- selves to report it, and we have to rely on the cloud provider announcing that the problem has been solved and the cloud is back to its fully operational state. Statistics of working periods and outages have been modelled in [19] with data coming from a small private cloud. Active measurement systems based on ICMP probing packets have been investigated in [20]–[22]. A major issue with all measurements campaigns conducted so far is that they do measure the quality of service experienced by the user, but in doing so they include the loss contribution provided by the network located between the cloud user and the cloud server. The availability that is measured in the end is an underestimation of the actual cloud availability. In this paper, we propose a measurement method that allows to distinguish between the losses due to the network and those due to the cloud, returning the true cloud availability. After describing the intrusive network problem in Section II and recalling the deﬁnition of availability in Section III, our study provides the following original contributions: • we propose a measurement procedure to measure true cloud availability (Section IV); • we assess its success rate (Section V), showing that it outperforms previous methods usable for that purpose; • we apply our procedure to three major cloud providers and contrast the results with concerns arisen in early measurement campaigns (Section V), showing that the Proceedings of the Federated Conference on Computer Science and Information Systems pp. 433–440 DOI: 10.15439/2019F59 ISSN 2300-5963 ACSIS, Vol. 18 IEEE Catalog Number: CFP1985N-ART c 2019, PTI 433