A Review of Big Data and Anonymization Algorithms 1 U. Selvi, 2 Dr. S. Pushpa 1 Research Scholar- Department of Computer science and Engg, St.Peter’s University, Avadi, Chennai- 2. Professor- Department of Computer science and Engg, St.Peter’s University, Avadi, Chennai- 1 slvunnikrishnan@gmail.com, 2 pushpasangar96@gmail.com Abstract Over the past twenty years, information has raised in a very massive scale in various fields. In 2010, Apache Hadoop outlined BigData as “datasets that couldn't be captured, managed, and processed by general computers within a tolerable time”. This paper begins with the definition, background knowledge and challenges of BigData. Then it shows the relation of BigData with other related technologies, like Cloud computing, web of things, information centers, and Hadoop. Big Data system is decomposing into four phases; particularly generation, acquisition, storage, and analysis of massive information and this paper make a case for every section. Finally, this paper examines the security issues in Big Data and compares various anonymization algorithms. These discussions aim to produce a comprehensive summary of Big Data and its security. Keywords: Big Data . Anonymization . Cloud Computing . Hadoop . Privacy Preservation . I. DEFINITION AND FEATURES OF BIG DATA Big Data can be defined as data sets that grow so large from several TB to ZB which cannot be managed by Traditional Database Management tools and it is difficult to capture, store, search, share, analyze and visualize data. Big Data include unstructured data and provides opportunities for discovering new values, understanding of unknown values, and gives as new challenges to effectively organize and manage such datasets. Nowadays, big data related to the service of Internet companies grow rapidly. Big data also brings about many challenging problems which needs solution with the rapid growth of Cloud computing and Internet of Things (IoT). The increasingly growing data cause a problem of how to store and manage such huge heterogeneous and datasets with moderate requirements on hardware software infrastructure. This leads to challenge of collecting and integrating enormous amount of data. Mining the Datasets can help in decision making which reveal its intrinsic property. Big data shall mean such datasets which could not be acquired, stored, and managed by classic database software. Big Data can be defined by 4V model [6], i.e., Volume, Velocity, Variety, and Value. In 4V model, Volume specifies generation and collection of large data; Velocity specifies timeliness of data collection and analysis; Variety specifies different types of data like structured, semi-structured and unstructured data; Value specifies hidden information from data. 1.1. Challenges of Big Data The increase in Data size raises the technical issues in data acquisition, storage, management and analysis of Big Data. Relational RDBMs could not handle huge volume and heterogeneity of big data. Hence for permanent storage and management of large-scale disordered datasets, distributed file systems and NoSQL [3] databases are good choices other than traditional database systems. These are some of the obstacles [4-5] in the development of Big data applications. - Data representation: makes data more meaningful and used for analysis and user interpretation. - Redundancy reduction and data compression: reduce the indirect cost of the entire system and the potential values of the data are not affected. - Data life cycle management: decides which data shall be stored and which data shall be discarded. Since current storage system could not support such massive data. - Data confidentiality: Big Data service providers or owners at present could not effectively maintain and analyze due to the huge size of data with limited capacity. They must rely on tools or third party to ensure safety. - Energy management: system-level power consumption control and management mechanism is needed for big data. Since, electrical energy is consumed with the increase of data volume, processing, storage, and transmission. Hence - Expendability and scalability: needs analytical system and algorithm of big data to process more complex and expanding datasets. - Cooperation: Big data network architecture must be established to help scientists and engineers in various fields access different kinds of data and International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 17 (2015) © Research India Publications ::: http://www.ripublication.com 13125