The Digital Aggregated Self: A Literature Review Dr. Lynne Y. Williams Professor, MSIT, School of Information Science and Technology Kaplan University Los Alamos, New Mexico, USA lwilliams4@kaplan.edu Dr. Diane M. Neal Assistant Professor, Faculty of Information and Media Studies The University of Western Ontario London, Ontario, Canada dneal2@uwo.ca Abstract—As the Internet rapidly establishes itself as a major communications conduit, growing concern exists about personal privacy issues and the related ownership of personal data. Privacy and personal data may be vulnerable to exposure by unauthorized individuals, by commercial entities wishing to profit from the data, and even by the individual to whom the data pertains. Although fragments of data may not present a privacy issue on their own, data mining and other aggregation methods quickly assemble data to create a considerably more sensitive “whole.” This article presents an examination of aggregated personal data ownership, or “the digital aggregated self,” using a literature review and an ethical argument. We propose that while server owners may possess the disaggregated user data stored on their servers, individuals should hold the rights to their set of aggregated data that is stored throughout the entire network of online servers. I. INTRODUCTION A. The state of digital personal data Fewer and fewer people today live a life without Internet contact. The ease of online shopping has made purchasing even the most uncommon items as simple as clicking a mouse button. Internet users can manage their bank accounts and credit cards online. They have access to an unprecedented variety of music, movies, television programs, and multimedia content of all kinds. In the course of shopping, managing accounts, or viewing movies, these millions of users leave a trail of personal data crumbs behind them, stored in bits and pieces on every server with which their transactions have exchanged information. Given that the majority of these commercial servers are generally well secured, does the user have any reason to be worried? The pieces of information stored on a single server may not be a cause for concern, but significant cause for concern may arise when all those pieces are combined using modern data mining techniques. In 2007, a pair of researchers at The University of Texas at Austin revealed the vulnerability of private data, even data collected in a supposedly anonymous fashion, [1]. Greengard stated that this research “proved that it was possible to identify individuals among a half-million participants by using public reviews published in the Internet Movie Database (IMDb) to identify movie ratings within Netflix’s data. In fact, eight ratings along with dates were enough to provide 99% accuracy, according to the researchers” (p. 17). The aggregation technique used by the University of Texas researchers is known as a linkage attack. A linkage attack combines individually non-sensitive data with data from other records until a highly accurate aggregated profile of the user is assembled. Linkage attacks are not a new phenomenon. Barbaro and Zeller [2] reported the consequences of releasing half a million AOL subscribers’ search records. Collating the searches made it possible to accurately re-assemble identities, including their names, addresses, and Social Security Numbers. Supposedly legitimate data mining presents no less of a threat to an individual’s privacy. However, some attempts have been made to protect an individual’s aggregated digital self from data mining exposure. In the United States, this type of safeguard usually falls within the scope of various governmental security standards such as HIPAA, [3], which protects the privacy of Americans’ personally identifying health information. Unfortunately, these types of standards do not exist in many contexts. Privacy research also offers an array of possible safeguards. For example, Dwork [4], a principal researcher for Microsoft ® , advocated a privacy protection method she referred to as differential privacy. Differential privacy does not focus on attempting to prevent all personally identifying information from inclusion in various online databases, which is the more common approach. Instead, differential privacy seeks to preserve privacy by altering the methods by which data is analyzed. According to Dwork, achievement of “absolute disclosure prevention” is impossible in the practical sense. This is primarily due to the existence of auxiliary information, which is information residing in different locations. For instance, information residing in a hospital database enjoys the protection of 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discover 978-0-7695-4810-4/12 $26.00 © 2012 IEEE DOI 10.1109/CyberC.2012.36 170