Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicol`o Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24 10129 Torino (Italy) {name}.{surname}@polito.it Abstract. In the last decade, we witnessed an increasing interest in High Performance Computing (HPC) infrastructures, which play an im- portant role in both academic and industrial research projects. At the same time, due to the increasing amount of available data, we also wit- nessed the introduction of new frameworks and applications based on the MapReduce paradigm (e.g., Hadoop). Traditional HPC systems are usu- ally designed for CPU- and memory-intensive applications. However, the use of already available HPC infrastructures for data-intensive applica- tions is an interesting topic, in particular in academia where the budget is usually limited and the same cluster is used by many researchers with diﬀerent requirements. In this paper, we investigate the integration of Hadoop, and its performance, in an already existing low-budget general purpose HPC cluster characterized by heterogeneous nodes and a low amount of secondary memory per node. Keywords: HPC, Hadoop, MapReduce, MPI applications 1 Introduction The amount of available data increases every day. This huge amount of data is a resource that, if properly exploited, provides useful knowledge. However, to be able to extract useful knowledge from it, eﬃcient and powerful systems are needed. One possible solution to the introduced problem consists in adopting the Hadoop framework [6], which exploits the MapReduce [1] paradigm for the eﬃcient implementation of data-intensive distributed applications. The recent years have also witnessed the increasing availability of general purpose HPC systems [3], such as clusters, commonly installed in many comput- ing centers. They are usually used to provide diﬀerent services to communities of users (e.g., academic researches) with diﬀerent requirements. These systems are usually designed for CPU- and memory-intensive applications. However, we wit- nessed some attempts to integrate Hadoop also in general purpose HPC systems, in particular in academia. Due to limited budgets, the integration of Hadoop in already available HPC systems is an interesting and fascinating problem. It will allow academic researches to continue to use their current MPI-based applica- tions and, at the same time, to exploit Hadoop to address new (data-intensive) problems without further costs.