International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 03 Issue: 04 | Apr-2016 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET ISO 9001:2008 Certified Journal Page 939 Data Security in Hadoop Distributed File System Sharifnawaj Y. Inamdar 1 , Ajit H. Jadhav 2 , Rohit B. Desai 3 , Pravin S. Shinde 4 , Indrajeet M. Ghadage 5 , Amit A. Gaikwad 6 . 1 Professor,Department of Computer Science & Engineering, DACOE Karad, Maharashtra, India 2,3,4,5,6 Student,Final Year B.E.-Computer Science & Engineering, DACOE Karad, Maharashtra, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Hadoop is most popularly used distributed programming framework for processing large amount of data with Hadoop distributed file system (HDFS) but processing personal or sensitive data on distributed environment demands secure computing. Originally Hadoop was designed without any security model. In this project, security of HDFS is implemented using encryption of file which is to be stored at HDFS. For encryption a real-time encryption algorithm is used. So a user who has the key for decryption can perform decryption of data & access that data for data mining. User authentication is also done for the system. We have also compared this method with the method previously implemented i.e. encryption & decryption using AES. Encrypting using AES results into growing of file size to double of original file & hence file upload time also increases. The technique used in this project removes this drawback. We have implemented method in which OAuth does the authentication and provide unique authorization token for each user which is used in encryption technique that provide data privacy for all users of Hadoop. The Real Time encryption algorithms used for securing data in HDFS uses the key that is generated by using authorization token. Key Words: Hadoop,Big data,Security,HDFS,OAuth. 1.INTRODUCTION Hadoop was developed from GFS (Google File System) [2, 3] and Map Reduce papers published by Google in 2003 and 2004 respectively. Hadoop is a framework of tools, implemented in Java. It supports running applications on big data. 1.1 Project Idea: Hadoop is designed without considering security of data. Data stored at HDFS is in plaintext. This data is prone to be accessed by unauthorized user. So method for securing this data is needed. Hence we are developing this highly secure system for Hadoop Distributed File System. 1.2 Need of project: Hadoop is generally executing in big clusters or might be in an open cloud administration. Amazon, Yahoo, Google, and so on are such open cloud where numerous clients can run their jobs utilizing Elastic MapReduce and distributed storage provided by Hadoop. It is key to execute the security of client information in such systems. Web produces expansive measure of information consistently. It incorporate the organized information rate on web is around 32% and unstructured information is 63%. Additionally the volume of advanced substance on web grows up to more than 2.7ZB in 2012 which is 48% more from 2011 and now soaring towards more than 8ZB by 2015. Each industry and business associations are has a critical information about various item, generation and its business sector review which is a major information advantageous for efficiency development. Fig-1:System Architecture The files in Hadoop distributed file system (HDFS) are divided into multiple blocks and replicated to other DataNodes(by default 2 nodes) to ensure high data availability and durability in case of failure of execution of job(parallel application in Hadoop environment). Originally Hadoop clusters have two types of node operating as master-