Volume IV, Issue VIII, August 2015 IJLTEMAS ISSN 2278 - 2540 www.ijltemas.in Page 73 Map Reduce Programming for Electronic Medical Records Data Analysis on Cloud using Apache Hadoop, Hive and Sqoop Sreekanth Rallapalli # , Gondkar RR * # Research Scholar, R&D Center, Bharathiyar University, Coimbatore, Tamilnadu *Professor, Department of IT, AIT, Bangalore Abstract—Health care organizations now a day’s made a strategic decision to turn huge medical data coming from various sources into competitive advantage. This will help the health care organizations to monitor any abnormal measurements which require immediate reaction. Apache Hadoop has emerged as a software framework for distributed processing of large datasets across large clusters of computers. Hadoop is based on simple programming model called MapReduce. Hive is a data warehousing framework built on top of hadoop. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volume of data. As health care and Electronic Medical Records (EMR) are generating huge data, it is necessary to store, extract and load such big data using a framework which support distributed processing. Cloud computing model provides efficient resources to store and process the data. In this paper we propose a MapReduce programming for Hadoop which can analyze the EMR on cloud. Hive is used to analyze large data of healthcare and medical records. Sqoop is used for easy data import and export of data from structured data stores such as relational databases, enterprise datawarehouses and NoSQL systems. Keywords—Hadoop;MapReduce;EMR;Hive;Healthcare I. APACHE HADOOP FOR BIG DATA ser Interactions online generated a huge data and in order to extend the services to scale the collection of data companies like Google, Facebook, Yahoo and many other companies have scaled up the capabilities of traditional information Technology architectures. In order to store, extract, transform and load these Big data these companies build their own core infrastructure components rapidly and various papers were published for many of components. All these components were open source. Apache Hadoop has been standardized for managing a large volumes of unstructured data [1]. Hadoop is an open source distributed software platform for storing and processing data. We can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by adding inexpensive nodes to the cluster. MapReduce programming will help the programmers to solve parallel-data problems for which the data set can be sub-divided into smaller parts and processed. The system splits data into multiple chunks which is assigned a map that process data in parallel. The map task reads input data as a set of (key,value) pairs and produce a transformed set of (key,value) pair as output. A master node has the job of distributing the work to worker nodes. The worker node just does one thing and returns the work back to the master node (i.e data processing). Once the master gets the work from the worker nodes, the reduce step takes over and combines all the work. By combining the work you can form some answer and ultimately output.(i.e data collection and digesting). Cloud computing is a model for enabling convenient, on- demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction [4]. Most of the health care organizations today are moving to cloud architecture to store and process the huge medical data. II. HADOOP ARCHITECTURE The Hadoop architecture is shown in the Fig1. The Hadoop distributed File system (HDFS) is a distributed file system providing fault tolerance and designed to run on commodity hardware. HDFS provides high throughput access to application data and is suitable for applications that have a large data sets. Hadoop provides a distributed file system called HDFS that can store data across thousands of servers, and a means of running work (Map/Reduce jobs) across those machines, which move code to data. HDFS have master/slave architecture. Hadoop runs on large clusters of commodity machines or on cloud computing services. Hadoop scales linearly to handle larger data by adding more nodes to the cluster. U