A Review of Big Data and Anonymization Algorithms
1
U. Selvi,
2
Dr. S. Pushpa
1
Research Scholar- Department of Computer science and Engg, St.Peter’s University, Avadi, Chennai-
2.
Professor- Department of Computer science and Engg, St.Peter’s University, Avadi, Chennai-
1
slvunnikrishnan@gmail.com,
2
pushpasangar96@gmail.com
Abstract
Over the past twenty years, information has raised in a
very massive scale in various fields. In 2010, Apache
Hadoop outlined BigData as “datasets that couldn't be
captured, managed, and processed by general
computers within a tolerable time”. This paper begins
with the definition, background knowledge and
challenges of BigData. Then it shows the relation of
BigData with other related technologies, like Cloud
computing, web of things, information centers, and
Hadoop. Big Data system is decomposing into four
phases; particularly generation, acquisition, storage,
and analysis of massive information and this paper
make a case for every section. Finally, this paper
examines the security issues in Big Data and compares
various anonymization algorithms. These discussions
aim to produce a comprehensive summary of Big Data
and its security.
Keywords: Big Data . Anonymization . Cloud
Computing . Hadoop . Privacy Preservation .
I. DEFINITION AND FEATURES OF BIG
DATA
Big Data can be defined as data sets that grow so
large from several TB to ZB which cannot be managed
by Traditional Database Management tools and it is
difficult to capture, store, search, share, analyze and
visualize data. Big Data include unstructured data and
provides opportunities for discovering new values,
understanding of unknown values, and gives as new
challenges to effectively organize and manage such
datasets.
Nowadays, big data related to the service of
Internet companies grow rapidly. Big data also brings
about many challenging problems which needs solution
with the rapid growth of Cloud computing and Internet
of Things (IoT). The increasingly growing data cause a
problem of how to store and manage such huge
heterogeneous and datasets with moderate requirements
on hardware software infrastructure. This leads to
challenge of collecting and integrating enormous
amount of data. Mining the Datasets can help in
decision making which reveal its intrinsic property.
Big data shall mean such datasets which could
not be acquired, stored, and managed by classic
database software. Big Data can be defined by 4V
model [6], i.e., Volume, Velocity, Variety, and
Value. In 4V model, Volume specifies generation
and collection of large data; Velocity specifies
timeliness of data collection and analysis; Variety
specifies different types of data like structured,
semi-structured and unstructured data; Value
specifies hidden information from data.
1.1. Challenges of Big Data
The increase in Data size raises the technical
issues in data acquisition, storage, management and
analysis of Big Data. Relational RDBMs could not
handle huge volume and heterogeneity of big data.
Hence for permanent storage and management of
large-scale disordered datasets, distributed file
systems and NoSQL [3] databases are good choices
other than traditional database systems.
These are some of the obstacles [4-5] in the
development of Big data applications.
- Data representation: makes data more meaningful
and used for analysis and user interpretation.
- Redundancy reduction and data compression:
reduce the indirect cost of the entire system and the
potential values of the data are not affected.
- Data life cycle management: decides which data
shall be stored and which data shall be discarded.
Since current storage system could not support
such massive data.
- Data confidentiality: Big Data service providers or
owners at present could not effectively maintain
and analyze due to the huge size of data with
limited capacity. They must rely on tools or third
party to ensure safety.
- Energy management: system-level power
consumption control and management mechanism
is needed for big data. Since, electrical energy is
consumed with the increase of data volume,
processing, storage, and transmission. Hence
- Expendability and scalability: needs analytical
system and algorithm of big data to process more
complex and expanding datasets.
- Cooperation: Big data network architecture must
be established to help scientists and engineers in
various fields access different kinds of data and
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 17 (2015)
© Research India Publications ::: http://www.ripublication.com
13125