2022 3rd International Conference on Intelligent Engineering and Management (ICIEM)
708
978-1-6654-6756-8/22/$31.00 ©2022 IEEE
HADOOP- An Open Source Framework for Big Data
Manish Kumar Gupta
Assistant Professor
Department of Computer Science and
Engineering
Buddha Institute of Technology
GIDA, Gorakhpur, India
manish.testing09@gmail.com
Shrawan Kumar Pandey
Assistant Professor
Department of Computer Science and
Engineering
Buddha Institute of Technology
GIDA, Gorakhpur, India
gupta.anish01@gmail.com
Anish Gupta
Professor
Department of Computer Science and
Engineering
Apex Institute of Technology
Chandigarh University
Mohali, Punjab, India
gupta.anish1979@gmail.com
Abstract: In this paper we will discuss about an open source
framework for storing and processing a huge amount of data,
known as HADOOP (High Availability Distributed Object
Oriented Platform). Originally HADOOP is written in Java
Language. HADOOP work on the concept of Write Once Read
as many as times as you want but don’t change the content of
the file (Stream Line Access Pattern). HADOOP consist a
cluster containing heterogeneous computing devices with
commodity hardware. A HADOOP cluster consist two things:
HDFS (Hadoop Distributed File System) and MapReduce.
HDFS used for data storage and MapReduce used for data
process. HDFS is suitable for storing data from Tear byte to
Petabyte on a cluster and it run on a commodity hardware.
Keywords: HADOOP, HDFS, MapReduce, Heterogeneous,
WORA
I. INTRODUCTION
Nowadays, we are living in the world of digital data where
data is growing in exponential order and unstructured
manner. There are basically three types of data- structured,
semi structured and unstructured. According to survey more
than 80% data are in unstructured format. So, storing and
processing this huge amount of unstructured data now
become a tedious task, where traditional concept will not
work. To solve this problem Google publish a paper on the
concept of GFS (Google file system), Map Reduce and Big
Table. After that Doug Cutting and Mike Cafarella introduces
Hadoop in 2006 which implemented on GFS and Map
Reduce. HADOOP is an open source framework for storing
and processing data in distributed environment [1], [2].
HADOOP become powerful open source framework for
processing large data sets [3], where node performs storing
and processing function.
There is a concept of 4 v’s in Big Data [4]
1. Volume – Scale of Data
2. Velocity – Streaming of Data
3. Variety – Different types of Data
4. Veracity – Uncertainty of Data
Figure 1. HADOOP Architecture
There are 4 components of HADDOP [3]-
1. HDFS
2. YARN
3. Map Reduce
4. HADOOP Common
Figure 2. HADOOP Components
Later we will discuss each HADOOP’s component one by
one.
There are 5 daemons runs on HADOOP ecosystem.
1. Name Node.
2022 3rd International Conference on Intelligent Engineering and Management (ICIEM) | 978-1-6654-6756-8/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICIEM54221.2022.9853179
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on September 19,2022 at 06:12:10 UTC from IEEE Xplore. Restrictions apply.