An Approach to Support the Design and the Dependability Analysis of High Performance I/O Intensive Distributed Systems Lucas Bressan (B ) , La´ ercio Pioli, Mario A. R. Dantas, Fernanda Campos, and Andr´ e L. de Oliveira ProgramadeP´osGradua¸c˜ ao Em Ciˆ encia da Computa¸c˜ ao, UFJF, Juiz de Fora, Brazil lucasbressan@ice.ufjf.br Abstract. Frequent service down times and poor system performance can affect aspects such as the availability, quality of experience and gen- erate millions of dollars in lost revenue. High Performance Computing (HPC) environments are often required to comply with performance and dependability requirements. The CHESS methodology provides support for the design and the evaluation of dependability and performance sys- tem attributes. In this paper we extend the CHESS methodology to sup- port the design and the dependability analysis of HPC environments. The proposed approach was employed in the Grid’5000, a highly dis- tributed and I/O intensive HPC environment. The application of the proposed approach provided key information for demonstrating depend- ability, deriving project decisions, agreeing on new design choices and resource allocation strategies. 1 Introduction Dependability is the ability of a system to operate as intended and to deliver its services when required and in a trusted manner [17]. It is broken down into availability, reliability, safety, security and resilience [21]. Fault tolerance relates to the capability of a system to continue operating as intended, after encounter- ing a failure [13]. Availability is directly related to fault tolerance and refers to the ability of a system to operate continuously by either protecting itself against or quickly recovering from failures [19]. Distributed architectures such as High Performance Computing (HPC) envi- ronments are often required to attend to performance and dependability require- ments. In certain domains (e.g.: industrial, military, banking and e-health) long service response times, failures and momentary service down times can affect their provided Quality of Experience (QoE) and generate undesirable or even contribute to catastrophic consequences. Thus, HPC environments must ensure their dependability, performance and are sometimes required to imple- ment redundancy, error detection, fault recovery capabilities [6] and provide low I/O times and data exchange latency [22]. c The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 L. Barolli et al. (Eds.): 3PGCIC 2020, LNNS 158, pp. 29–40, 2021. https://doi.org/10.1007/978-3-030-61105-7_4