A Scalable and Robust Coordination Architecture for Distributed Management Srinath Perera, Dennis Gannon Computer Science Department Indiana University Bloomington IN 47405 {hperera, gannon}@cs.indiana.edu Abstract While opening avenues for unlimited possibilities, dis- tributed systems have introduced management complexity as an unfavorable trait. Therefore, as distributed systems become commonplace, the automation of system manage- ment has become a primary challenge in information tech- nology. The state of art in system management assigns each managed resource to an external entity (manager), which monitors, analyzes and controls the resource and a collec- tion of such managers manages a system. In such settings, each manager has to act with partial knowledge about the system, and to maintain the system as a whole in accept- able state, those managers should be controlled and coor- dinated. This paper presents a scalable and robust coordi- nation architecture for distributed management. The pro- posed architecture consists of a cloud of managers placed on a P2P network, and a coordinator, which re-elects on failure. Each resource in the system is assigned to a man- ager, and managers monitor the system and maintain a dis- tributed data model, which reflects system state (a meta- model). Using the meta-model, each manager enforces a set of user-defined management rules to implement resource level management, and the global coordination is achieved using user-defined, global management rules enforced by the coordinator. Main contributions of the paper are, a co- ordination architecture for distributed management which supports elections based recovery, a meta-model which re- flects the system state, and the application of rules on top of the meta-model to achieve manager coordination. 1. Introduction Traditionally distributed computing had chosen to focus on systems that share data (e.g. Internet, Database systems, and Distributed file systems). However, with the advent of computing paradigms like Service Oriented Architectures (SOA) and Grid computing, systems that closely coordinate hundreds to thousands of computers towards common goals are emerging. Distributed workflow systems, computation clouds, and stream processing are few examples of such systems. Furthermore, another notable trend is to replace expensive super computers with groups of small commod- ity hardware, each costs only minute fraction of the former. For an example, Google has claimed to build their colossal architecture using thousands of cheap commodity hardware. Unlike their ancestors, those new systems must be closely coordinated to achieve common goals. As a re- sult, failure of a single component would have serious con- sequences than failure of a single machine in the internet. On the other hand, within a system that has thousands of independent components, failures are norm rather than an exception. For an example, if we built a system with units that has 5 years of mean life, and when a thousand of them are put together, suddenly a unit would start to fail once ev- ery two days in average. Given such a gigantic deployment a component may fail due to thousands of reasons, varying from network failures, hard disks, to software errors. To avoid having systems brought down to a grinding halt every couple of days, they need to be monitored and managed. In such settings, the fate of promising concepts like Grid and Service Oriented Architecture will be decided by the availability of distributed monitoring and management solu- tions to sustain them. Therefore building reliable and usable monitoring and management systems is a critical prerequi- site for future distributed systems. 1.1. Distributed Management A managed system is comprised of manageable re- sources. A manageable resource is a part of the managed system that can be remotely managed and reasonably sep- arated as an independent entity. A manageable resource could be defined in different granularities; however, a ser- vice represents a typical manageable resource. A manage- able resource includes sensors, which monitor the system state, and actuators, which allow a remote authority to con-