International Journal of Computer Applications (0975 8887) Volume 100No.18, August 2014 24 Automatic Declassification of Textual Documents by Generalizing Sensitive Terms Veena Vasudevan PG Scholar, CSE T.K.M College of Engineering Kerala, India Ansamma John Associate professor, CSE T.K.M College of Engineering Kerala, India ABSTRACT With the advent of internet, large numbers of text documents are published and shared every day .Each of these documents is a collection of vast amount of information. Publically sharing of some of this information may affect the privacy of the document, if they are confidential information. So before document publishing, sanitization operations are performed on the document for preserving the privacy and inorder to retain the utility of the document. Various schemes were developed to solve this problem but most of them turned out to be domain specific and most of them didn’t consider the presence of semantically correlated terms. This paper presents a generalized sanitization method that discovers the sensitive information based on the concept of information content. The proposed method removes the confidential information from the text document by first finding the independent sensitive terms. Then with the use of these sensitive terms the correlated terms that cause a disclosure threat are discovered. Again with the help of a generalization algorithm these sensitive and correlated terms with high disclosure risk are generalized. General Terms Text Mining, Privacy Preserving Data Publishing, Redaction, Sanitization Keywords Document Declassification, Generalization, Information content, Privacy, Term correlation, Unstructured Data, utility 1. INTRODUCTION The growth of information sharing applications led to the increase in the number of documents being shared. But the risk of violating the privacy of an individual or organization also increases. Researchers are trying to find out the problems associated with sharing the private data and the remedies for it [1]. They had also looked into the importance of anonymity and/or privacy in diverse application areas: e-voting [2], Electronic Health records [3], social networking [4], electronic mail [5], etc. The information which is confidential or reveals the identity of a person or organization is considered as sensitive. So before sharing the document the sensitive information must be removed in such a way that the privacy and utility is retained. Document sanitization is the process of removing sensitive or confidential information from a document. The main objective of sanitizing the document is to preserve privacy but at the same time the utility is retained. Various schemes are used to identify and protect sensitive data. So document sanitization is a two step process. In earlier days, the sensitive information is identified and removed manually. But it proves to be time consuming, tedious and expensive. It also does not scale well as the volume of text data increases. So semi-automatic and automatic methods are developed. The proposed system is an automatic document sanitization system which can be used to sanitize all types of documents irrespective of a particular field. The first stage is identification of sensitive information. It again is a two step task: first independent sensitive terms are detected [6], [7] and in the second step semantically correlated terms, which posses a high disclosure risk, are identified. The terms obtained in the above two steps together form the sensitive terms, which are the input to the second stage .In the second stage of document sanitization the detected sensitive terms are sanitized. Inorder to preserve the utility of the document the sensitive terms are replaced by their generalized versions [8], [9], [10]. 2. RELATED WORK In the early stage of document sanitization, more work is done on sanitizing the structured documents such as relational databases. Later the need to sanitize the unstructured documents had come into notice. This need is revealed in initiatives from DARPA [11] or the Consortium for Healthcare Informatics Research (CHIR)[12] which aim at building new methods and tools for declassification of confidential documents. In the structured documents the structure itself provides the key to identify sensitive terms. But in an unstructured text document sensitive information identification is a difficult task. Earlier it was done manually by trained experts, who remove the sensitive terms from the document based on standard guidelines [13] and rules. It proved to be costly, time consuming and does not scale well as the volume of data increases. Hence semi-automatic and automatic methods were proposed. In earlier days the sanitization is performed in medical documents inorder to hide the sensitive information’s related to patients. The information is treated as sensitive based on the Health Information portability and Accountability Act (HIPAA) of 1996.According to the safe harbor rules 18 entities are considered as sensitive terms such as name, geographic locations, dates, e-mail address, telephone numbers etc. Latanya Sweeney [14] proposed a system called scrub, based on this rule, to identify the identifiable information in a patient’s record. It uses detection algorithms and replacement algorithms to identify the sensitive details and for replacing them respectively. Regular expression type templates and knowledge sources are used to detect sensitive terms. It detects almost 99-100% of personal information but it fails to detect the nick names, additional phone numbers,