International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 04 | Apr -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 3229 Efficient Resolution for the NameNode Memory Issue for the Access of Small Files in HDFS Deeksha S P 1 , R Kanagavalli 2 , Dr. Kavitha K S 3 , Dr. Kavitha C 4 1 PG Student, Department of CSE, Global Academy of Technology, Bengaluru, Karnataka, India 2 Associate Professor, Department of CSE, Global Academy of Technology, Bengaluru, Karnataka, India 3 Professor, Department of CSE, Global Academy of Technology, Bengaluru, Karnataka, India 4 Professor & HOD, Department of CSE, Global Academy of Technology, Bengaluru, Karnataka, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Hadoop Distributed File System (HDFS) was initially designed for storing, processing and accessing huge files. But large number small files will not be processed and accessed efficiently. This paper introduces an access improvement approach for HDFS small files by using Map File called TLB-Map File. This TLB-Map File combines many small files into large files by Map File component to lessen NameNode memory utilization and include TLB table in DataNode, and to enhance access efficiency of small files. The method first merges small files into large file and stores it in HDFS. Then the retrieval frequency of each file is accessed from logs in the system, and stored in TLB table. The information of the location of the block where the small file is stored is also filled in TLB Table. This Table is regularly updated. Thus the TLB-MapFile Approach efficiently resolves the retrieval issues of small files by priory fetching the files based on table. Key Words: HDFS, small files, TLBMapFile, retrieval, priory fetching. 1. INTRODUCTION Hadoop is an open source software framework used for storing, accessing, and processing of huge datasets in distributed environment. Hadoop is built on clusters of commodity hardware. Each server in single machine stores large data and provides local computation which can be extended to thousands of machines. It is derived from Google’s file system and MapReduce[1]. It is also suitable to detect and handle failures. It can be used by the application which processes large amount of data with the help of large number of independent computers in the cluster. In Hadoop distributed architecture, both data and processing are distributed across multiple computers. Hadoop consists of Hadoop Distributed File System (HDFS) which is used for storing data and MapReduce Programming model which is used for processing the data. HDFS is a Java- based Hadoop Framework which provides scalable, distributed and portable form of file system. HDFS is garble tolerant and is highly scalable. HDFS has a Master-Slave Architecture. It has a single NameNode and multiple DataNodes. NameNode stores the file system Metadata and connects clients to files. DataNode stores the actual data and is responsible for giving response to read and write requests of clients. It can be seen that the structure with only one NameNode simplifies the file system, but HDFS is at first designed for storing and processing huge files. So the small files which can be saved in HDFS will consume a more memory in NameNode. Every metadata object occupies about one hundred fifty bytes of memory [1], assuming that the quantity of small files reaches a thousand million, the metadata holds memory almost 40G. Similarly, the mass of small files will cause a huge number of jumps and looking of to and fro in DataNode and time taken to access can be very high. Based on MapFile, this paper provides a new small file accessing optimization scheme: TLB-MapFile. For this approach, small files are merged into large files, then the data of small files that being high-frequency accessed is obtained via access audit logs. Subsequently, the mapping data between block and small documents are made to be saved in TLB and are up to date frequently. When a file is accessed once more, the mapping data is retrieved inside the TLB table, then the mapping data of related files also are obtained. This can be done via pre fetching mechanism and this gains speedy prefetching small files. 2. LITERATURE SURVEY Small file is the file whose size is less than the HDFS default block size which is 64MB. With a view to enhance the access efficiency of small files, a few scholars have carried out related studies. With the intention to quickly discover small files, a common strategy is to merge small files into big ones via merge and index mechanisms. HDFS distributed file storage system comes with a small handling mechanism: Hadoop Archive (HAR) [2], and SequenceFile [3-4] . Hadoop Archive (HAR) is specifically used to archive files in HDFS for decreasing memory utilization of NameNode.