Big Data Processing Using Hadoop MapReduce Programming Model Anumol Johnson #1 Master Of Technology, Computer Science And Engineering Sahrdaya college of Engineering And Technology Calicut University, Kerala Havinash P.H #2 Assistant Professor. Computer Science And Engineering Sahrdaya college of Engineering And Technology Calicut University, Kerala Vince Paul #3 Head Of the Department, Computer Science And Engineering Sahrdaya college of Engineering And Technology Calicut University, Kerala Mr.Sankaranarayanan P.N #4 Assistant Professor, Computer Science And Engineering Sahrdaya college of Engineering And Technology Calicut University, Kerala Abstract— In today’s age of information technology processing data is a very important issue. Nowadays even terabytes and petabytes of data is not sufficient for storing large chunks of database. The data is too big, moves too fast, or doesn’t fit the structures of the current database architectures. Big Data is typically large volume of un-structured and structured data that gets created from various organized and unorganized applications, activities such as emails web logs, Facebook, etc. The main difficulties with Big Data include capture, storage, search, sharing, analysis, and visualization. Hence companies today use concept called Hadoop in their applications. Even sufficiently large amount of data warehouses are unable to satisfy the needs of data storage. Hadoop is designed to store large amount of data sets reliably. It is an open source software which supports parallel and distributed data processing. Along with reliability and scalability features Hadoop also provide fault tolerance mechanism by which system continues to function correctly even after some components fails working properly. Fault tolerance is mainly achieved using data duplication and making copies of same data sets in two or more data nodes. MapReduce is a programming model and an associated implementation for processing and generating large datasets that is flexible to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines. Keywords— Big data, Hadoop, Distributed file system, MapReduce I. INTRODUCTION The emerging big-data paradigm, owing to its broader impact, has profoundly transformed our society and will continue to attract diverse attentions from both technological experts and the public in general. It is obvious that we are living a data deluge era, evidenced by the sheer volume of data from a variety of sources and its growing rate of generation. The exponential growth of data first presented challenges to cutting-edge businesses such as Google, Yahoo, Amazon, Microsoft, Facebook, Twitter etc. Data volumes to be processed by cloud applications are growing much faster than computing power. For instance, an IDC report predicts that, from 2005 to 2020, the global data volume will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, representing a double growth every two years. The term of “big-data" was coined to capture the profound meaning of this data-explosion trend and indeed the data has been touted as the new oil, which is expected to transform our society. The huge potential associated with big-data has led to an emerging research that has quickly attracted tremendous interest from diverse sectors, for example, industry, government and research community. Government has also played a major role in creating new programs to accelerate the progress of tackling the bigdata challenges. This growth demands new strategies for processing and analyzing information. Hadoop has become a powerful Computation Model addresses to these problems. Hadoop HDFS became more popular amongst all the Big Data tools as it is open source with flexible scalability, less total cost of ownership and allows data stores of any form without the need to have data types or schemas defined. Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. Map reduce is a software frame work introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.The original MapReduce implementation by Google, as well as its open-source counterpart, Hadoop, is aimed for parallelizing computing in large clusters of Anumol Johnson et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (1) , 2015, 127-132 www.ijcsit.com 127