IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. IV (Jan – Feb. 2016), PP 23-29 www.iosrjournals.org DOI: 10.9790/0661-18142329 www.iosrjournals.org 23 | Page A Review: Hadoop Storage and Clustering Algorithms Latika Kakkar 1 , Gaurav Mehta 2 1 (Department of Computer Science and Engineering, Chitkara University, India) 2 (Department of Computer Science and Engineering, Chitkara University, India) Abstract : In the last few years there has been voluminous increase in the storing and processing of data, which require convincing speed and also requirement of storage space. Big data is a defined as large, diverse and complex data sets which has issues of storage, analysis and visualization for processing results. Four characteristics of Big data which are–Volume, Value, Variety and Velocity makes it difficult for traditional systems to process the big data. Apache Hadoop is an auspicious software framework that develops applications that process huge amounts of data in parallel with large clusters of commodity hardware in a fault-tolerant and veracious manner. Various performance metrics such as reliability, fault tolerance, accuracy, confidentiality and security are improved as with the use of Hadoop. Hadoop MapReduce is an effective Computation Model for processing large data on distributed data clusters such as Clouds. We first introduce the general idea of big data and then review related technologies, such as could computing and Hadoop. Various clustering techniques are also analyzed based on parameters like numbers of clusters, size of clusters, type of dataset and noise. Keywords: Big data, Cloud Computing, Clusters, Hadoop , HDFS, MapReduce I. Introduction To analyze data from different aspects and to outline and sum up this data into valuable data i.e. important information is known as Data Mining. Data Mining is process of analyzing correlations and patterns among huge data in a database. Major data mining techniques involve classification, regression and clustering. Clustering is the process of assigning data into groups called clusters such that object in same cluster is more similar than the objects of other clusters. Clustering is a main task of Data Mining. It common technique for statistical data analysis used in many fields which includes pattern recognition, analysis of image, information retrieval. Now a day’s big data is growing rapidly specially those related to internet companies. For example, Google, Facebook processes petabytes of data within a month.Figure1 shows the increase of the data volume in a rapid manner. There are various challenges as the datasets are increasing in a drastic manner. Due to advancement of information technology huge amount of data can be generated. On an average, YouTube uploads 72 hours of videos in every minute [1]. This creates the problem of collection and integration of tremendous data from widespread sources of data. The accelerated advancement of cloud computing and the Internet of Things further contributes the fine increase of data. Through this increase in volume and variety there is a great accentuation on the existing computing capacity. This huge data causes problem of storing and managing such large complex datasets with fulfilling the hardware and software infrastructure. The mining of such complex, heterogeneous and voluminous data at different levels of analyzing, perception, anticipating should be done carefully to unfold its natural properties so as to filter good decision making. [2] II. Challenges Of Big Data 2.1 Confidentiality of Data Various big data service providers use different tools to process and analyze data, due to which there are security risks. For example, the transactional dataset generally includes a set of complete operating data to execute the important business processes. Such data consists of lower granularity and delicate information like credit card numbers. Therefore, preventive measures are taken to protect such sensitive data, to ensure its safety. 2.2 Representation of Data The datasets are heterogeneous in nature. Representation of Data aims to make meaningful data for analysis of computers and user comprehension. Improper representation of data affects the data analysis. Efficient data representation shows data structure, class and technologies which enables useful operations on different datasets. 2.3 Management of Data Life Cycle The storage system used normally cannot support huge quantity of data. Data freshness is the key component which describes the values private in big data. Therefore, to decide which data shall be stored and which data shall be ignored, a principle associated with the analytical value should be developed.