IJSTE - International Journal of Science Technology & Engineering | Volume 1 | Issue 12 | June 2015 ISSN (online): 2349-784X All rights reserved by www.ijste.org 251 Evaluation Parameters of Infrastructure Resources Required for Integrating Parallel Computing Algorithm and Distributed File System Naveenkumar Jayakumar Sneha Singh Research Scholar UG Student Department of Computer Engineering Department of Computer Engineering Bharati Vidyapeeth Deemed University, College of Engineering, Pune - 43 Bharati Vidyapeeth Deemed University, College of Engineering, Pune - 43 Suhas. H. Patil Shashank .D.Joshi Professor Professor Department of Computer Engineering Department of Computer Engineering Bharati Vidyapeeth Deemed University, College of Engineering, Pune - 43 Bharati Vidyapeeth Deemed University, College of Engineering, Pune - 43 Abstract The technology and increasing population in digital world has driven towards drastic explosion of the data scattered over various digital components and network nodes. On the other hand various technologies being enhanced and innovated in order to keep up the processing and conversion of these raw data into useful information in various fields with proliferating data. Since the Data and the application to process these data are quantitatively increasing, the infrastructure also needs to be changed or upgraded in order to meet the current state of requirements. The question arises here is how the upgraded resources will be beneficial and in which way they impact the performance of the application. This paper focuses on understanding how the infrastructure resources will impact end to end performance of the distributed computing platform and what are the parameters to be considered with high priority for addressing the performance issues in an distributed environment. Keywords: Performance Evaluation, Active Storage, Storage Array, Resource Utilization, Parallel & Distributed Systems ________________________________________________________________________________________________________ I. INTRODUCTION Now a days. The buzz in the distributed computing platform is Map reduce and full-fledged researches are going in enhancing the performance of such distributed environment. There are research carried out by [1] for integrating various parallel computing model and distributed file systems like map reduce integrated with lustre. Lustre has different benefits compared to HDFS. These integration are technically dealt and deployed in the same infrastructure without considering the requirement needed in infrastructure as well. MapReduce is a distributed computational algorithm and is widely used for large scale jobs. Currently, the implementation of MapReduce is by the help of open source Hadoop framework. By default, Hadoop uses HDFS (Hadoop Distributed File System) for the implementation of MapReduce. Instead of HDFS, we may use Hadoop on distributed file system like Lustre. Lustre is a kind of parallel distributed file system used for large scale cluster computing. Lustre is derived from two words, i.e. Linux and cluster. Linux is the platform for implementing MapReduce and Cluster is the collection of computers at a distance from each other but connected via the network. MapReduce breaks the input data into a number of limited size chunks (The size is specified earlier in the algorithm) in parallel. The algorithm then converts the data of chunks to a group of intermediate key value pairs in a set of Map tasks. Next is the shuffle phase where each key’s values is shuffled and processes the combined key values as outp ut data with the help of reduce tasks.[2] The MapReduce framework is written in Java, making it platform independent (as Java can run on Linux as well). Here, we can say that Lustre is better as compared to HDFS because once data is written in HDFS, it cannot be modified while in Lustre, the data is stored in Object storage servers(OSSs) and metadata is stored in the Metadata Servers(MDSs). As stated earlier, Lustre is designed for large scale computations, I/O intensive and performance sensitive applications. Using Lustre as the backend for Hadoop job allows flexibility in assigning mapper tasks which means all available nodes can be used for the same job, without any problem of network unlike HDFS where the number and location of mapper tasks for a specific job is fixed by the distribution of the input data. HDFS thus leaves most of the cluster idle. In Lustre, the data can be moved to any of the available resources and thus providing 100% utilization of the node.[3][4]