International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 04 | Apr -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 3229
Efficient Resolution for the NameNode Memory Issue for the Access of
Small Files in HDFS
Deeksha S P
1
, R Kanagavalli
2
, Dr. Kavitha K S
3
, Dr. Kavitha C
4
1
PG Student, Department of CSE, Global Academy of Technology, Bengaluru, Karnataka, India
2
Associate Professor, Department of CSE, Global Academy of Technology, Bengaluru, Karnataka, India
3
Professor, Department of CSE, Global Academy of Technology, Bengaluru, Karnataka, India
4
Professor & HOD, Department of CSE, Global Academy of Technology, Bengaluru, Karnataka, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Hadoop Distributed File System (HDFS) was
initially designed for storing, processing and accessing huge
files. But large number small files will not be processed and
accessed efficiently. This paper introduces an access
improvement approach for HDFS small files by using Map
File called TLB-Map File. This TLB-Map File combines many
small files into large files by Map File component to lessen
NameNode memory utilization and include TLB table in
DataNode, and to enhance access efficiency of small files. The
method first merges small files into large file and stores it in
HDFS. Then the retrieval frequency of each file is accessed
from logs in the system, and stored in TLB table. The
information of the location of the block where the small file
is stored is also filled in TLB Table. This Table is regularly
updated. Thus the TLB-MapFile Approach efficiently resolves
the retrieval issues of small files by priory fetching the files
based on table.
Key Words: HDFS, small files, TLBMapFile, retrieval,
priory fetching.
1. INTRODUCTION
Hadoop is an open source software framework used for
storing, accessing, and processing of huge datasets in
distributed environment. Hadoop is built on clusters of
commodity hardware. Each server in single machine stores
large data and provides local computation which can be
extended to thousands of machines. It is derived from
Google’s file system and MapReduce[1]. It is also suitable to
detect and handle failures. It can be used by the application
which processes large amount of data with the help of large
number of independent computers in the cluster. In Hadoop
distributed architecture, both data and processing are
distributed across multiple computers.
Hadoop consists of Hadoop Distributed File System (HDFS)
which is used for storing data and MapReduce Programming
model which is used for processing the data. HDFS is a Java-
based Hadoop Framework which provides scalable,
distributed and portable form of file system. HDFS is garble
tolerant and is highly scalable. HDFS has a Master-Slave
Architecture. It has a single NameNode and multiple
DataNodes. NameNode stores the file system Metadata and
connects clients to files. DataNode stores the actual data and
is responsible for giving response to read and write requests
of clients.
It can be seen that the structure with only one NameNode
simplifies the file system, but HDFS is at first designed for
storing and processing huge files. So the small files which
can be saved in HDFS will consume a more memory in
NameNode. Every metadata object occupies about one
hundred fifty bytes of memory [1], assuming that the
quantity of small files reaches a thousand million, the
metadata holds memory almost 40G. Similarly, the mass of
small files will cause a huge number of jumps and looking of
to and fro in DataNode and time taken to access can be very
high.
Based on MapFile, this paper provides a new small file
accessing optimization scheme: TLB-MapFile. For this
approach, small files are merged into large files, then the
data of small files that being high-frequency accessed is
obtained via access audit logs. Subsequently, the mapping
data between block and small documents are made to be
saved in TLB and are up to date frequently. When a file is
accessed once more, the mapping data is retrieved inside the
TLB table, then the mapping data of related files also are
obtained. This can be done via pre fetching mechanism and
this gains speedy prefetching small files.
2. LITERATURE SURVEY
Small file is the file whose size is less than the HDFS default
block size which is 64MB. With a view to enhance the access
efficiency of small files, a few scholars have carried out
related studies.
With the intention to quickly discover small files, a common
strategy is to merge small files into big ones via merge and
index mechanisms.
HDFS distributed file storage system comes with a small
handling mechanism: Hadoop Archive (HAR) [2], and
SequenceFile [3-4] .
Hadoop Archive (HAR) is specifically used to archive files in
HDFS for decreasing memory utilization of NameNode.