2022 3rd International Conference on Intelligent Engineering and Management (ICIEM) 708 978-1-6654-6756-8/22/$31.00 ©2022 IEEE HADOOP- An Open Source Framework for Big Data Manish Kumar Gupta Assistant Professor Department of Computer Science and Engineering Buddha Institute of Technology GIDA, Gorakhpur, India manish.testing09@gmail.com Shrawan Kumar Pandey Assistant Professor Department of Computer Science and Engineering Buddha Institute of Technology GIDA, Gorakhpur, India gupta.anish01@gmail.com Anish Gupta Professor Department of Computer Science and Engineering Apex Institute of Technology Chandigarh University Mohali, Punjab, India gupta.anish1979@gmail.com Abstract: In this paper we will discuss about an open source framework for storing and processing a huge amount of data, known as HADOOP (High Availability Distributed Object Oriented Platform). Originally HADOOP is written in Java Language. HADOOP work on the concept of Write Once Read as many as times as you want but don’t change the content of the file (Stream Line Access Pattern). HADOOP consist a cluster containing heterogeneous computing devices with commodity hardware. A HADOOP cluster consist two things: HDFS (Hadoop Distributed File System) and MapReduce. HDFS used for data storage and MapReduce used for data process. HDFS is suitable for storing data from Tear byte to Petabyte on a cluster and it run on a commodity hardware. Keywords: HADOOP, HDFS, MapReduce, Heterogeneous, WORA I. INTRODUCTION Nowadays, we are living in the world of digital data where data is growing in exponential order and unstructured manner. There are basically three types of data- structured, semi structured and unstructured. According to survey more than 80% data are in unstructured format. So, storing and processing this huge amount of unstructured data now become a tedious task, where traditional concept will not work. To solve this problem Google publish a paper on the concept of GFS (Google file system), Map Reduce and Big Table. After that Doug Cutting and Mike Cafarella introduces Hadoop in 2006 which implemented on GFS and Map Reduce. HADOOP is an open source framework for storing and processing data in distributed environment [1], [2]. HADOOP become powerful open source framework for processing large data sets [3], where node performs storing and processing function. There is a concept of 4 v’s in Big Data [4] 1. Volume – Scale of Data 2. Velocity – Streaming of Data 3. Variety – Different types of Data 4. Veracity – Uncertainty of Data Figure 1. HADOOP Architecture There are 4 components of HADDOP [3]- 1. HDFS 2. YARN 3. Map Reduce 4. HADOOP Common Figure 2. HADOOP Components Later we will discuss each HADOOP’s component one by one. There are 5 daemons runs on HADOOP ecosystem. 1. Name Node. 2022 3rd International Conference on Intelligent Engineering and Management (ICIEM) | 978-1-6654-6756-8/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICIEM54221.2022.9853179 Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on September 19,2022 at 06:12:10 UTC from IEEE Xplore. Restrictions apply.