77 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 12, No. 1, April 2020 Articulation Point Based Quasi Identifier Detection for Privacy Preserving in Distributed Environment Ila Chandrakar 1 , Vishwanath R Hulipalled 2 1 Presidency University & Research Scholar, REVA University, Bangalore, India 2 School of C&IT, REVA University, Bangalore, India Abstract: These days, huge data size requires high-end resources to be stored in IT organizations premises. They depend on cloud for additional resource necessities. Since cloud is a third-party, we cannot guarantee high security for our information as it might be misused. This necessitates the need of privacy in data before sharing to the cloud. Numerous specialists proposed several methods, wherein they attempt to discover explicit identifiers and sensitive data before distributing it. But, quasi-identifiers are attributes which can spill data of explicit identifiers utilizing background knowledge. Analysts proposed strategies to find quasi- identifiers with the goal that these properties can likewise be considered for implementing privacy But, these techniques suffer from many drawbacks like higher time consumption and decreased data utility. The proposed work overcomes this drawback by extracting minimum required quasi attributes with minimum time complexity. Keywords: Articulation point, Privacy Preserving, Quasi Identifier 1. Introduction The speedy development in information technology gave birth to many social media websites which uses our personal data for their usage. These websites may not be trust worthy and leak user’s data. Similarly, due to digitization, there is huge increase in data size in all the organization which became a big problem for data owners. They are unable to afford huge requirement of resources and are using cloud for the need of extra resources. Since, cloud is a third party, it can be curious and try to leak data. Also, data are published by the organization for the purpose of the research, hence there is a need to implement privacy for the data before publishing it. In past years, many incidents happened pertaining to leakage of data from organization. In 2019, personal contents of many email accounts were exposed from many accounts of Microsoft Office 365. In 2018, another huge data leak happened by Facebook in which 50 million user account personal details were exposed and another 40 million were exposed because of those accounts. In 2013, approximately all the users of Yahoo were affected as their username and password were leaked by Russian hackers. In 2014, the data like name, contact, and passport number etc. of 500 million customers of Marriot International was leaked and many more cases are there in which huge amount of personal data is leaked. Personal data privacy is very important for any individual. This can be achieved by hiding the personal sensitive information before publishing it. The attributes in a data set can be categorized into three types. First is explicit identifiers which directly give information of the subject of interest i.e. from which one can easily identify a person and his details like name, social security number etc. Second is sensitive attributes which give identification of a person, but it is a private information of person which if gets leaked, may cause harm. For example, salary of an employee or type of disease of any person in health data set is sensitive information. Third type of attributes is quasi identifier attributes which are subset of attributes or combination of attributes which together can identify a person. For example, gender, zip code and date of birth can identify maximum population in USA. The earlier privacy preserving techniques were concerned about hiding explicit identifiers and sensitive attributes in the data set but it is still vulnerable as the quasi identifiers can be used to extract personal information of an individual using linking attacks. These quasi identifiers can easily leak information of explicit identifiers. Many researchers have used k-anonymity to overcome linking referred to as link between explicit identifiers and quasi identifiers [36] [37]. In k-anonymity method, generalization method has been applied on quasi identifiers to convert it into more generalized form. This can provide privacy to quasi attributes but only to some extent. But the problem is how to find the optimum number of quasi attributes in a data set. If too many quasi attributes are identified and if we apply privacy techniques like k- anonymity on those big number attributes (which is not exact set), it decreases data utility of the data set. Also, if number is too less, it causes privacy leak. In many research work, the quasi attributes are found from experts based their personal experience. But, finding quasi attributes in this way is not very accurate. The objective of this work is to find the minimum or optimum number of quasi attributes in the data set in optimal time and complexity. This improves the performance in implementing privacy as it is just optimal number of quasi attributes. 2. Related Work Owing to advancement in Web 2.0 technologies, the user’s data are openly accessible in social platforms[28]. The accessed data are misused by the third parties for commercial purposes [1]. The k-anonymity l-diversity schemes do optimize the published data, but only partially. Hence, this issue has to be addressed [2–4]. The first research on providing privacy in data mining was limited to implementing privacy on centralized data. Later on, it was implemented for distributed data. Many data distortion techniques like perturbation, adding noise to the data to change original data, generalization, k-anonymity, L- diversity etc. were used [28] [29] [38]. Other researchers used cryptographic techniques to provide privacy like