Unified Heterogeneous HPCC Hardware Management Framework Yung-Chin Fang, Jeffrey Mayerson, Rizwan Ali, Monica Kashyap; Jenwei Hsieh; Tau Leng, Victor Mashayekhi Dell Inc. {Yung-Chin_fang; Jeffrey_mayerson; Rizwan_ali; Monica_kashyap; Jenwei_hsieh; Tau_leng; Victor_Mashayekhi}@dell.com Abstract The remote, hardware level management of heterogeneous clusters (such as the remote power cycling of a hung node) is a necessary task for a computer center. This task requires knowledge across multiple specifications, fabrics (hardware, firmware, software, management) and implementations. For a heterogeneous cluster environment, there is little in common across hardware level management interface implementations. In a heterogeneous HPCC, grid or cyber-infrastructure environment, there is a need to have a common hardware management interface across unique architecture, platform, firmware, software and management fabric implementations. This paper presents the framework of a unified interface across heterogeneous clusters to overcome these differences. This paper also addresses certain findings in the prototyping process. 1. Management specifications The management specifications described in this section are designed for the operational and deployment phases of the cluster life cycle model [1]. The purpose of management specifications is to enhance uptime and reduce total cost of ownership. These specifications are designed to facilitate the hardware, firmware and software implementations used on a platform to perform local/remote monitoring and management of the node level hardware health condition, without interfering with the node’s computing performance. The frequently used HPC cluster management features are primarily based on well defined management specifications. For example: in order to remotely deploy the operating system and cluster computing software stack to a new cluster, system administrators will often utilize the hardware level remote power management specification implementation to remotely power up the nodes in serial to facilitate pre-boot execution environment (PXE) [2] or extensible firmware interface (EFI) [3] level network boot. These implementations are used to remote deploy the OS and cluster computing software stack to the nodes across a cluster. PXE is usually implemented in an IA32 platform’s basic input output system (BIOS) or in network interface card ROM. EFI level network boot is implemented at? In? network interface’s EFI level driver. Remote power management is defined in advanced configuration and power interface (ACPI) [4], and ACPI is included in the Wired for Management (WfM) [5] specification. Remote power management is also addressed in the intelligent platform management interface (IPMI) [6] specification. Along with LM sensor management [7], IPMI and WfM are the three most commonly implemented specifications. LM sensor management is a de facto standard from the 1990’s, designed to use an embedded management processor, such as the LM81 [8], which utilized many sensors; such as CPU temperature, voltage, and fan RPM sensors, etc. to monitor and manage the node- level hardware health condition. There are some dedicated management buses which are independent to host data/address/control bus. An administrator can use an operating system level agent to query and control a sensor’s reading via the LM processor. The operating system level agent can pass sensor readings to a centralized management console via a management fabric, and this is usually referred to as in-band management. There are two common views on the definition of in-band management. The first