13
Efficient and fault-tolerant distributed host
monitoring using system-level diagnosis*
M. Bearden and R. Bianchini, Jr.
Department of Electrical and Computer Engineering,
Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 USA
Telephone: 412-268-7105, Fax: 412-268-3890
mbearden@ece.cmu.edu, rpb@ece.cmu.edu
Abstract
This paper presents an efficient and fault-tolerant distributed approach to monitoring the status
of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decen-
tralized program that gathers processor information, such as CPU load, user information, and
network and disk statistics, in parallel at each processor and reliably distributes the information
on-line to all fault-free processors. Information is filtered at each processor and distributed at
different priorities to conserve communication resources. Fault-tolerance is achieved by apply-
ing the results of previous system-level diagnosis research. An on-line distributed system-level
diagnosis algorithm that assumes the PMC fault model and a fully connected network is
extended to consistently maintain user-defined state information in an unreliable environment.
DSMon has been implemented and currently operates on approximately 200 networked
workstations in the Electrical and Computer Engineering Department at Carnegie Mellon Uni-
versity. The key results of this paper include the extension of a distributed system-level diagno-
sis algorithm for reliable broadcast of current global state, and the specification of the DSMon.
A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for
guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete
history of all broadcast updates. The DSMon implementation is described, and its operation in
an actual distributed network environment is analyzed. Extensions to this work include other
fault and system models and applicability to other distributed applications requiring consistent
distributed global state.
Keywords
Fault-tolerance, system-level diagnosis, distributed monitoring, reliable broadcast
•
This research was supported in part by the National Science Foundation (NSF) under Grant CCR- 9257973 and
by a NSF Graduate Research Fellowship.
A. Schill et al. (eds.), Distributed Platforms
© Springer Science+Business Media Dordrecht 1996