De-Identification of Textual Data using Immune System for Privacy
Preserving in Big Data
Amine Rahmani
1
GeCoDe laboratory, department of informatics sciences,
Dr. Tahar Moulay university of Saida
aminerahmani2091@gmail.com
Abdelmalek Amine
2
GeCoDe laboratory, department of informatics sciences,
Dr. Tahar Moulay university of Saida
amine_abd1@yahoo.fr
Mohamed Reda Hamou
3
GeCoDe laboratory, department of informatics sciences, Dr. Tahar Moulay university of Saida
hamoureda@yahoo.fr
Abstract—With the growing observed success of big data
use, many challenges appeared. Timeless, scalability and
privacy are the main problems that researchers attempt to
figure out. Privacy preserving is now a highly active
domain of research, many works and concepts had seen
the light within this theme. One of these concepts is the de-
identification techniques. De-identification is a specific
area that consists of finding and removing sensitive
information either by replacing it, encrypting it or adding
a noise to it using several techniques such as cryptography
and data mining. In this report, we present a new model of
de-identification of textual data using a specific Immune
System algorithm known as CLONALG.
Keywords—de-identification, privacy preserving, big
data, immune systems, CLONALG
I. INTRODUCTION.
One of the advantages of big data’s services is the
ability of sharing and publish data over the network.
Those data can be sorted in two major categories:
normal like books and other textual documents, and
sensitive information such as names, medical books, and
social information generally. Those last requires a high
tier of protection for its importance and sensitivity
because if it will be linked together, it forms a total or
partial presentation of their owner; which leads to
identify him even if this data do not contain any explicit
identifiers. The aggregation of this information can
presents a unique identity of the person as like as the
fingerprint. In addition, the data, once are stored on the
web, it becomes accessible and treatable by a third party
and, therefore, by other people who shared the same
resources which make the privacy an essential aim to
ensure. That's what gives birth to a new domain known
as Privacy Preserving Data Publishing (PPDP) which
offers a set of methods and techniques for protection of
users’ privacy. Many deeds are performed within this
arena and a lot of approaches are published and used for
that, these approaches can be covered on three essential
groups:
• Heuristic based approaches in which a set
of works are done using data mining
algorithms in the form of adaptive
modification of selected data. This is based
on the fact that the selective data
modification is an NP-hard problem so that
this group of methods is addressed to the
complex problems.
• Cryptography based approaches that are
represented by a secure multiparty
computation where the privacy is
guaranteed basing on a probabilistic
function in order to ensure that at the end
for multiparty computations neither party
can knows except its own input and the
final results of computation.
• Perturbation and re-construction of data in
which the proposed approaches consist of
ensuring data by re-constructing randomly
the distribution of data on such aggregated
level.
One of the techniques of PPDP is the de-
identification in which such system consists to detect
and remove any information leads to the individuality of
such user through his own data. In this work we propose
a new approach based on Immune system in order to
ensure privacy by detecting and modifying the
information leading to identity of users so that we start,
in the rest of the paper, with a presentation of basic
concepts such as PPDP and its techniques focusing on
de-identification and modification technique. Then we
pass to the presentation of our idea and its results. And
finally, we finished with the discussion of results and the
final conclusion.
II. BASIC CONCEPTS
A. Privacy preserving data publishing
A data publisher is typically a data collector that
consists in collecting data from Different sources, then
pass it to a data miner or publish it to the public which
can include an attacker.
The Fig 1 shows the point of view of (Fung, Wang,
Chen & YU, 10) about a data publisher.
2015 IEEE International Conference on Computational Intelligence & Communication Technology
978-1-4799-6023-1/15 $31.00 © 2015 IEEE
DOI 10.1109/CICT.2015.146
112