The Digital Aggregated Self: A Literature Review
Dr. Lynne Y. Williams
Professor, MSIT, School of Information Science and
Technology
Kaplan University
Los Alamos, New Mexico, USA
lwilliams4@kaplan.edu
Dr. Diane M. Neal
Assistant Professor, Faculty of Information and Media
Studies
The University of Western Ontario
London, Ontario, Canada
dneal2@uwo.ca
Abstract—As the Internet rapidly establishes itself as a
major communications conduit, growing concern exists
about personal privacy issues and the related ownership
of personal data. Privacy and personal data may be
vulnerable to exposure by unauthorized individuals, by
commercial entities wishing to profit from the data, and
even by the individual to whom the data pertains.
Although fragments of data may not present a privacy
issue on their own, data mining and other aggregation
methods quickly assemble data to create a considerably
more sensitive “whole.” This article presents an
examination of aggregated personal data ownership, or
“the digital aggregated self,” using a literature review and
an ethical argument. We propose that while server owners
may possess the disaggregated user data stored on their
servers, individuals should hold the rights to their set of
aggregated data that is stored throughout the entire
network of online servers.
I. INTRODUCTION
A. The state of digital personal data
Fewer and fewer people today live a life without
Internet contact. The ease of online shopping has made
purchasing even the most uncommon items as simple as
clicking a mouse button. Internet users can manage
their bank accounts and credit cards online. They have
access to an unprecedented variety of music, movies,
television programs, and multimedia content of all
kinds. In the course of shopping, managing accounts, or
viewing movies, these millions of users leave a trail of
personal data crumbs behind them, stored in bits and
pieces on every server with which their transactions
have exchanged information.
Given that the majority of these commercial
servers are generally well secured, does the user have
any reason to be worried? The pieces of information
stored on a single server may not be a cause for
concern, but significant cause for concern may arise
when all those pieces are combined using modern data
mining techniques.
In 2007, a pair of researchers at The University of
Texas at Austin revealed the vulnerability of private
data, even data collected in a supposedly anonymous
fashion, [1]. Greengard stated that this research
“proved that it was possible to identify individuals
among a half-million participants by using public
reviews published in the Internet Movie Database
(IMDb) to identify movie ratings within Netflix’s data.
In fact, eight ratings along with dates were enough to
provide 99% accuracy, according to the researchers” (p.
17).
The aggregation technique used by the University
of Texas researchers is known as a linkage attack. A
linkage attack combines individually non-sensitive data
with data from other records until a highly accurate
aggregated profile of the user is assembled. Linkage
attacks are not a new phenomenon. Barbaro and Zeller
[2] reported the consequences of releasing half a
million AOL subscribers’ search records. Collating the
searches made it possible to accurately re-assemble
identities, including their names, addresses, and Social
Security Numbers.
Supposedly legitimate data mining presents no
less of a threat to an individual’s privacy. However,
some attempts have been made to protect an
individual’s aggregated digital self from data mining
exposure. In the United States, this type of safeguard
usually falls within the scope of various governmental
security standards such as HIPAA, [3], which protects
the privacy of Americans’ personally identifying health
information. Unfortunately, these types of standards do
not exist in many contexts.
Privacy research also offers an array of possible
safeguards. For example, Dwork [4], a principal
researcher for Microsoft
®
, advocated a privacy
protection method she referred to as differential
privacy. Differential privacy does not focus on
attempting to prevent all personally identifying
information from inclusion in various online databases,
which is the more common approach. Instead,
differential privacy seeks to preserve privacy by
altering the methods by which data is analyzed.
According to Dwork, achievement of “absolute
disclosure prevention” is impossible in the practical
sense. This is primarily due to the existence of
auxiliary information, which is information residing in
different locations. For instance, information residing
in a hospital database enjoys the protection of
2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discover
978-0-7695-4810-4/12 $26.00 © 2012 IEEE
DOI 10.1109/CyberC.2012.36
170