International Journal of Computer Applications (0975 – 8887) Volume 90 – No 10, March 2014 1 MRDS Data Processing and Mining using Hadoop in Cloud Ravindra P. Bachate Department of Computer Engineering, JSPM’s JSCOE, Hadapsar Pune, Maharashtra, India H. A. Hingoliwala Department of Computer Engineering, JSPM’s JSCOE, Hadapsar Pune, Maharashtra, India ABSTRACT This project explores the use of Hadoop framework for MRDS (Mineral Resources data system) data processing and mining in cloud. Cloud computing provides efficient computation and analysis for large data. To improve the performance of system for massive data, Hadoop provides Map Reduce technique. Hadoop has a distributed file system (HDFS) that stores data on the cluster nodes. This project focuses on to provide real time information of mineral resources stored in cloud environment with minimum data processing time. Storing MRDS data in to the cloud ensures the availability and reliability of it. Keywords Hadoop, cloud computing, data processing, data mining 1. INTRODUCTION Due to the drastic development in various sectors, size of data increases day by day. One computer can read 30-35 MB data per second. For example if data size is 100 TB, approximately it will take 1 month to process it [4]. So obviously lots of data mining and data processing required for getting important information from the available data. Our world is data driven. For example Science has databases from astronomy, genomics, transportation data, environment data etc. Likewise medicine, entertainment, commerce, humanities and social sciences [4]. The data which challenges to current technologies to store, process and use called as a big data. Big Data are high-volume and high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. Now a day’s more than 80% businesses are relying on cloud because of the services and features provided with cloud environment. Cloud is a repository of such huge data and challenge of cloud provider is to manage and process the data into it. Mineral Resources Data System is a collection of data describing metallic and nonmetallic mineral resources throughout the world [8]. It includes resource name, location, commodity, geologic characteristics, resource description, production, reserves, and references. As MRDS contains mineral resources data around the world, it is large and complex. If data size goes beyond the Tera Byte, it is difficult to manage and process the data by using RDBMS. The performance of RDBMS decreases as data size increases. To make MRDS data available to all the time, we need to keep it in the cloud environment. Traditional approach to deal with such massive data is ETL i.e. extract, transform and load it into RDBMS. But spatial data is available in the unstructured format and it is not easy for RDBMS is to cope with it. Fig.1 Map Reduce Architecture To deal with big data like MRDS in the cloud, we need a best technology which can cope with it. There are two options available, parallel DBMS and Hadoop Map Reduce technology. But Hadoop Map Reduce gives a better data processing performance with minimum cost and time as compare to parallel DBMS because it works with commodity hardware. Hadoop has HDFS file system for storing a big data into it. The Hadoop framework provides a solution for problems of massive data processing; because it runs applications on large cluster built of commodity hardware with failure tolerance [5].Unstructured data can be processed with Hadoop Map Reduce technique which is not possible with RDBMS. Map Reduce provides flexibility and fault tolerance which is not with parallel DBMS. Map Reduce provides automatic parallelization, data partitioning, task scheduling, handling machine failures and manages inter- machine communication. Hadoop is totally transparent from the end user. The rate of growing an unstructured data is much more as compare to the structured data. The unstructured data includes media files, heavy text files etc. 2. LITERATURE SURVEY Hongyong Yu, Deshuai Wang [1] proposed a system for data processing and mining log data of SaaS cloud using Hadoop. We focused on Hadoop’s Map Reduce technique and the algorithm used for data mining by Hongyong Yu, Deshuai