Tritium: A Cross-layer Analytics System for Enhancing
Microservice Rollouts in the Cloud
Sadie Allen
Boston University
Mert Toslali
Boston University
Srinivasan Parthasarathy
IBM Thomas J Watson Research Center
Fabio Oliveira
IBM Thomas J Watson Research Center
Ayse K. Coskun
Boston University
ABSTRACT
Microservice architectures are widely used in cloud-native applica-
tions as their modularity allows for independent development and
deployment of components. With the many complex interactions oc-
curring in between components, it is difficult to determine the effects
of a particular microservice rollout. Site Reliability Engineers must be
able to determine with confidence whether a new rollout is at fault for
a concurrent or subsequent performance problem in the system so they
can quickly mitigate the issue. We present Tritium, a cross-layer ana-
lytics system that synthesizes several types of data to suggest possible
causes for Service Level Objective (SLO) violations in microservice
applications. It uses event data to identify new version rollouts, tracing
data to build a topology graph for the cluster and determine services
potentially affected by the rollout, and causal impact analysis applied
to metric time-series to determine if the rollout is at fault. Tritium
works based on the principle that if a rollout is not responsible for a
change in an upstream or neighboring SLO metric, then the rollout’s
telemetry data will do a poor job predicting the behavior of that SLO
metric. In this paper, we experimentally demonstrate that Tritium
can accurately attribute SLO violations to downstream rollouts and
outline the steps necessary to fully realize Tritium.
CCS CONCEPTS
• Software and its engineering → Software testing and debug-
ging;• Computer systems organization → Cloud computing.
KEYWORDS
Fault diagnosis, container systems, microservices, version rollouts
ACM Reference Format:
Sadie Allen, Mert Toslali, Srinivasan Parthasarathy, Fabio Oliveira, and Ayse
K. Coskun. 2021. Tritium: A Cross-layer Analytics System for Enhancing
Microservice Rollouts in the Cloud. In Proceedings of WoC ’21: Workshop on
Container Technologies and Container Clouds (WoC ’21). ACM, New York,
NY, USA, 6 pages. https://doi.org/10.1145/3493649.3493656
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
WoC ’21, December 6, 2021, Virtual Event, Canada
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-9171-9/21/12. . . $15.00
https://doi.org/10.1145/3493649.3493656
1 INTRODUCTION
Microservice applications have complex and dynamic interactions
and runtime environments, and this complexity makes it hard to
reproduce or diagnose failures in a testing environment. Faults or
performance anomalies could be the result of improper cluster con-
figuration, asynchronous service interactions, differences between
multiple instances of the same service, actual source code of a ser-
vice, or countless other issues [11]. One concern for Site Reliability
Engineers (SREs) is managing new service rollouts, which constantly
happen due to the practice of continuous integration and deploy-
ment [4]. These rollouts do not occur in isolation; varying request
volume, resource and load fluctuations, and countless other events
can happen at or near the same time, making it difficult to determine
if the rollout caused a significant change in the system, or if it was due
to one of these sources of noise.
1.1 Related Work
Fault diagnosis in microservice systems has already been gaining
attention; there are numerous recent works in this space [3, 5, 7, 16,
19, 22–25]. Many past efforts solve a piece of the problem of fault
diagnosis, but do not provide a comprehensive picture of the activities
in a microservice application. In addition, no prior existing works tar-
get rollout-specific fault diagnosis. In this section, we briefly discuss
some of the most relevant works and their drawbacks.
In Qiu et al.’s resource management framework FIRM[18], they
implement a localization algorithm to identify the microservice at
fault for an end-to-end SLO violation. Their algorithm first identifies
critical paths (paths of maximal duration starting with client requests)
and then uses a binary incremental SVM classifier to decide whether
each service in the critical path may be a candidate for being at fault
for the SLO violation. This localization algorithm requires training
on artificially injected performance anomalies prior to implementa-
tion on a system, and its reliance on critical paths means it is only
applicable to SLOs related to request latency.
Guo et al. developed a system called Graph-based Microservice
Trace Analysis (GMTA). This system abstracts traces into “paths”
representing business flows and uses these trace aggregates to aid
in visualizing service dependencies and diagnose problems in the
system by indicating anomalous traces [7]. While GMTA does pro-
vide efficient and flexible storage and access to trace data at several
granularities, it is primarily a data storage and visualization tool. It
can aid in human understanding of system architecture and problem
diagnosis, but lacks any automated detection or pinpointing of issues.
Some prior approaches aim to make use of more than one type
of data from the application. Luo et al. proposed a fault diagnosis
approach that leverages event and time-series data [15]. Their goal
19