The Network Stack Trace: Performance diagnosis for networked systems Justin McCann jmccann@cs.umd.edu University of Maryland Michael Hicks mwh@cs.umd.edu University of Maryland Abstract Transient network stalls that degrade application per- formance are frustrating to users and developers alike. Software bugs, network congestion, and intermittent connectivity all have the same symptoms—low through- put, high latency, and user-level timeouts. In this paper, we show how an end host can identify the sources of net- work stalls using only simple counters from its local net- work stack. By viewing the network stack as a producer- consumer dependency graph and monitoring its activity as a whole, our rule-based expert system correctly iden- tifies which modules are hampering performance over 99% of the time, with false positive rates under 3%. The result is a network stack trace—a lightweight snapshot of the end host’s networking stack that describes the behav- ior of each application, socket, connection, and interface. 1 Introduction Diagnosing performance degradation in distributed sys- tems is a complex and difficult task. Software that per- forms well in one environment may be unusably slow in another, and determining the root cause is time- consuming and error-prone, even in enterprise environ- ments where all the data may be available. End users have an even more difficult time trying to diagnose sys- tem performance. When a user’s video stream has prob- lems, it could be for any number of reasons: the browser plugin may be buggy, the neighbors’ wireless networks may be creating interference, their computer or the server may be overloaded, or there may be congestion along the Internet path. To the user, the symptoms are all the same: a stalled or stuttering application. In this paper, we present Network Stack Trace (NeST), a system with which an end host can identify the source of short network stalls that lead to low throughput, high latency, and connection timeouts. Our goal is roughly equivalent to determining where messages are being blocked or dropped, e.g., in the application, in TCP’s buffers, in the IP network, or at the physical layer. NeST treats a host’s network stack as a dependency graph of modules (applications, sockets, connections, tunnels, in- terfaces) that provide service to each other. Higher-layer modules must produce messages for lower-layer mod- ules to send, and consume messages as they are received. As described in Section 2, rather than trace specific mes- sages through the stack to see where they are getting dropped or hung up, we monitor a few basic counters exported by each module, and accumulate evidence from them to make a diagnosis. In particular, diagnosis pro- ceeds in three steps: (1) snapshot the packet counters, queue lengths, and error counters for each module; (2) make observations from those counters about the mod- ule’s (in)activity; and (3) perform a dependency analysis, relating one module’s state to that of its dependents and neighbors, to determine the likelihood that the module is misbehaving. For the last step we employ a heuristic-based expert system, described in Section 3. Comparing a module’s behavior with that of related modules allows us to re- solve questions that cannot be answered by examining each module in isolation. For example, if a module M’s counters are not increasing, the module could be stuck, or it might simply have nothing to do. To provide evi- dence in favor of one explanation over the other, we can examine the counters at M’s predecessor P: if we find that P’s counters have not increased either, then M is un- likely to be at fault; on the other hand, if P’s counters have increased, then we can infer that messages are get- ting stuck at M and place more blame there. We can look at M’s neighbors to make further inferences. In partic- ular, a host generally has more than one open network connection, so if many flows are experiencing problems, then the culprit is likely a dependency held in common. All of these inferences are inexact, so our algorithm em- ploys probabilistic certainty factors [5] to accumulate the weight of the evidence for or against a certain module, ul- 1