Towards a Better Replica Management for Hadoop Distributed File System Hilmi Egemen Ciritoglu * , Takfarinas Saber , Teodora Sandra Buda , John Murphy * Christina Thorpe * * Performance Engineering Laboratory, School of Computer Science, University College Dublin, Dublin, Ireland hilmi.egemen.ciritoglu@ucdconnect.ie, {christina.thorpe, j.murphy}@ucd.ie Natural Computing Research and Applications Group, School of Business, University College Dublin, Dublin, Ireland takfarinas.saber@ucd.ie Cognitive Computing Group, Innovation Exchange, IBM Ireland, Dublin, Ireland tbuda@ie.ibm.com Abstract—The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the repli- cation of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average. Keywords-Hadoop Distributed File System, Replication Fac- tor, Software Performance. I. I NTRODUCTION In recent years, the exponential growth of data and the consequential need to process tremendous volumes, has resulted in a demand for efficient data analysis systems, i.e., systems that can support large-scale, data-intensive analytics. This led many companies to store and analyse their data on distributed systems. One popular example of this is Hadoop [1], a software framework that has been developed by the Apache Software Foundation to store and process big data in an efficient, reliable, and distributed manner. One of the main components of Hadoop is the Hadoop Distributed File System (HDFS) [2]. HDFS is responsible for storing large data sets on distributed machines. Different processing engines (e.g., MapReduce [3], Spark [4]), or applications (e.g., data warehouse systems such as Hive [5] and Pig [6]) run on top of HDFS. Therefore, optimising HDFS is crit- ical for the performance of the Hadoop ecosystem as any improvement on HDFS will affect the overall system. Replication is a well-known technique for improving the performance of HDFS [7], [8], as increasing the replica- tion factor is directly linked to increasing data availability. Some proposals in the literature aim at increasing the data availability of big data systems using adaptive replication factor frameworks. These frameworks assign a popularity ratio for each file in the system in either a proactive [9], or dynamic [10], [11], [12] way, and use this popularity ratio to define the replication factor. Hadoop systems are long-running systems; thus the demand for files can change over time. Although varying the replication factor can allow significant gains in performance, the placement of replicas is a crucial problem in clusters [7], [13], [14]. As the replication factor changes, so as the block density of each node, leading to performance degradation. Therefore, in homogeneous clusters, better data placement algorithms split data into equal chunks and distribute them on nodes evenly. While there has been some work in the literature analysing the impact of increasing the replication factor [7], [8], to the best of our knowledge, there is no work analysing the effects of decreasing it. To the best of our knowledge, our paper is the first to identify a major data unbalancing problem in Hadoop’s replica deletion algorithm, which has the potential to significantly degrade the performance of the system. As a solution, we propose a novel Workload-aware Balanced Replica Deletion algorithm (WBRD) to prevent this unbalancing problem on Hadoop clusters. We investigate the performance enhancement of WBRD by conducting a thorough performance evaluation. The contributions of this paper can be summarised as follows: (i) we identified a data unbalancing problem resulting in a major performance degradation when the replication factor is decreased, (ii) we formally defined the replica deletion problem, and (iii) we proposed a new deletion algorithm (WBRD) to address this problem, which improves the performance up to 48% with only a small overhead. The remainder of this paper is organised as follows: In section II, we provide background information and related work about HDFS. Section III identifies and models the replica deletion problem in HDFS. Section IV details our novel WBRD algorithm. Section V describes the exper- imental environment. Section VI presents results of our evaluation. Finally, section VII concludes this paper.