13 Efficient and fault-tolerant distributed host monitoring using system-level diagnosis* M. Bearden and R. Bianchini, Jr. Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 USA Telephone: 412-268-7105, Fax: 412-268-3890 mbearden@ece.cmu.edu, rpb@ece.cmu.edu Abstract This paper presents an efficient and fault-tolerant distributed approach to monitoring the status of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decen- tralized program that gathers processor information, such as CPU load, user information, and network and disk statistics, in parallel at each processor and reliably distributes the information on-line to all fault-free processors. Information is filtered at each processor and distributed at different priorities to conserve communication resources. Fault-tolerance is achieved by apply- ing the results of previous system-level diagnosis research. An on-line distributed system-level diagnosis algorithm that assumes the PMC fault model and a fully connected network is extended to consistently maintain user-defined state information in an unreliable environment. DSMon has been implemented and currently operates on approximately 200 networked workstations in the Electrical and Computer Engineering Department at Carnegie Mellon Uni- versity. The key results of this paper include the extension of a distributed system-level diagno- sis algorithm for reliable broadcast of current global state, and the specification of the DSMon. A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete history of all broadcast updates. The DSMon implementation is described, and its operation in an actual distributed network environment is analyzed. Extensions to this work include other fault and system models and applicability to other distributed applications requiring consistent distributed global state. Keywords Fault-tolerance, system-level diagnosis, distributed monitoring, reliable broadcast This research was supported in part by the National Science Foundation (NSF) under Grant CCR- 9257973 and by a NSF Graduate Research Fellowship. A. Schill et al. (eds.), Distributed Platforms © Springer Science+Business Media Dordrecht 1996