Tritium: A Cross-layer Analytics System for Enhancing Microservice Rollouts in the Cloud Sadie Allen Boston University Mert Toslali Boston University Srinivasan Parthasarathy IBM Thomas J Watson Research Center Fabio Oliveira IBM Thomas J Watson Research Center Ayse K. Coskun Boston University ABSTRACT Microservice architectures are widely used in cloud-native applica- tions as their modularity allows for independent development and deployment of components. With the many complex interactions oc- curring in between components, it is difﬁcult to determine the effects of a particular microservice rollout. Site Reliability Engineers must be able to determine with conﬁdence whether a new rollout is at fault for a concurrent or subsequent performance problem in the system so they can quickly mitigate the issue. We present Tritium, a cross-layer ana- lytics system that synthesizes several types of data to suggest possible causes for Service Level Objective (SLO) violations in microservice applications. It uses event data to identify new version rollouts, tracing data to build a topology graph for the cluster and determine services potentially affected by the rollout, and causal impact analysis applied to metric time-series to determine if the rollout is at fault. Tritium works based on the principle that if a rollout is not responsible for a change in an upstream or neighboring SLO metric, then the rollout’s telemetry data will do a poor job predicting the behavior of that SLO metric. In this paper, we experimentally demonstrate that Tritium can accurately attribute SLO violations to downstream rollouts and outline the steps necessary to fully realize Tritium. CCS CONCEPTS • Software and its engineering → Software testing and debug- ging;• Computer systems organization → Cloud computing. KEYWORDS Fault diagnosis, container systems, microservices, version rollouts ACM Reference Format: Sadie Allen, Mert Toslali, Srinivasan Parthasarathy, Fabio Oliveira, and Ayse K. Coskun. 2021. Tritium: A Cross-layer Analytics System for Enhancing Microservice Rollouts in the Cloud. In Proceedings of WoC ’21: Workshop on Container Technologies and Container Clouds (WoC ’21). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3493649.3493656 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from permissions@acm.org. WoC ’21, December 6, 2021, Virtual Event, Canada © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-9171-9/21/12. . . $15.00 https://doi.org/10.1145/3493649.3493656 1 INTRODUCTION Microservice applications have complex and dynamic interactions and runtime environments, and this complexity makes it hard to reproduce or diagnose failures in a testing environment. Faults or performance anomalies could be the result of improper cluster con- ﬁguration, asynchronous service interactions, differences between multiple instances of the same service, actual source code of a ser- vice, or countless other issues [11]. One concern for Site Reliability Engineers (SREs) is managing new service rollouts, which constantly happen due to the practice of continuous integration and deploy- ment [4]. These rollouts do not occur in isolation; varying request volume, resource and load ﬂuctuations, and countless other events can happen at or near the same time, making it difﬁcult to determine if the rollout caused a signiﬁcant change in the system, or if it was due to one of these sources of noise. 1.1 Related Work Fault diagnosis in microservice systems has already been gaining attention; there are numerous recent works in this space [3, 5, 7, 16, 19, 22–25]. Many past efforts solve a piece of the problem of fault diagnosis, but do not provide a comprehensive picture of the activities in a microservice application. In addition, no prior existing works tar- get rollout-speciﬁc fault diagnosis. In this section, we brieﬂy discuss some of the most relevant works and their drawbacks. In Qiu et al.’s resource management framework FIRM[18], they implement a localization algorithm to identify the microservice at fault for an end-to-end SLO violation. Their algorithm ﬁrst identiﬁes critical paths (paths of maximal duration starting with client requests) and then uses a binary incremental SVM classiﬁer to decide whether each service in the critical path may be a candidate for being at fault for the SLO violation. This localization algorithm requires training on artiﬁcially injected performance anomalies prior to implementa- tion on a system, and its reliance on critical paths means it is only applicable to SLOs related to request latency. Guo et al. developed a system called Graph-based Microservice Trace Analysis (GMTA). This system abstracts traces into “paths” representing business ﬂows and uses these trace aggregates to aid in visualizing service dependencies and diagnose problems in the system by indicating anomalous traces [7]. While GMTA does pro- vide efﬁcient and ﬂexible storage and access to trace data at several granularities, it is primarily a data storage and visualization tool. It can aid in human understanding of system architecture and problem diagnosis, but lacks any automated detection or pinpointing of issues. Some prior approaches aim to make use of more than one type of data from the application. Luo et al. proposed a fault diagnosis approach that leverages event and time-series data [15]. Their goal 19