1 Abstract— Hadoop is open source programming framework which enables the processing of large amount of data in distributed computing environment. Instead of depending on expensive hardware and different system for storing and processing data, Hadoop enables parallel processing on Big Data. The exponential growth of data first posed challenges to Google, Yahoo, Amazon, and Facebook. They need to go through terabytes and petabytes of data to figure out particular queries and demands of users. Existing tools were becoming inadequate to process such large data sets. Hadoop solved the problem of handling such a huge amount of data. Hadoop has two core components: 1.The Hadoop Distributed File System (HDFS). It understands and assigns work to the nodes in a cluster. 2.The MapReduce. It is software framework used for distributed computing and processing of large amounts of data in the Hadoop cluster. In this paper we describe the framework of Hadoop along with how actually hadoop works on Big Data sets. The efficiency and scalability of the cluster depends heavily on the performance of the single NameNode. Index Terms— Hadoop, HDFS, Name node, Data node, MapReduce I. INTRODUCTION The main challenge in front of IT world is to store and analyze huge quantities of data. Every single day huge amount of data is going to add from various fields like science, economics, Engineering and commerce. To analyze such huge amounts of data for better understanding of users needs have lead to development of data intensive applications and storage clusters. The basic requirement of these applications is a highly available, highly scalable and reliable cluster based storage system. To cope up with these problems Google developed the MapReduce programming model. Hadoop is popular open-source implementation of MapReduce. Hadoop develops software framework for reliable and scalable distributed computing. Hadoop architecture consists of Hadoop Distributed File System (HDFS) and a programming model MapReduce to perform computation on a cluster of commodity computers. Many universities, enterprises and industries works on distributed computing but for newcomers use of big data and its analysis is difficult task due to lack of experience and insufficient money to invest in servers. For such users, it is essential to use limited resources effectively and efficiently. II. ARCHITECTURAL DESIGN OF HADOOP A. Name Node Hadoop has a master/slave architecture for both distributed storage and distributed computation. The distributed storage system is called the Hadoop Distributed File System. The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks [5]. The NameNode keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed file system.The function of the NameNode is memory and I/O intensive. As such, the server hosting the NameNode tǇpiĐallǇ doesŶt store aŶǇ user data or perforŵ aŶǇ computations for a MapReduce program to lower the workload on the machine. This means that the NameNode serǀer doesŶt douďle as a DataNode or a TaskTraĐker. B. Data Node Each slave machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed file system - reading and writing HDFS blocks to actual files on the local file system [5].The data nodes can communicate to each other to rebalance data, move and copy data around and keep the replication high. client can communicate directly with the DataNode daemons to process the local files corresponding to the blocks. Processing Big Data using Hadoop Framework Prashant D. Londhe, Satish S. Kumbhar, Ramakant S. Sul, Amit J. Khadse College of Engineering Pune, College of Engineering Pune, College of Engineering Pune, College of Engineering Pune prashantlondhe10@gmail.com, ssk.comp@coep.ac.in, sul.ramakant@gmail.com, amitamit.khadse515@gmail.com