Dynamic Tessellation of Geographical Regions to Ensure K-anonymity Hamilton Turner, Danny Guymon, Brian Dougherty, Jules White Dept. of Electrical and Computer Engineering Virginia Tech Email:{hamiltont, dguymon, brianpd, julesw}@vt.edu Abstract—Smartphone-powered data collection systems are rapidly becoming an effective method of gathering field data. One major challenge of using smartphones to collect data is the ability to link smartphone metadata, such as location at a specific time, back to the user–thereby violating the privacy of that individual. A promising approach to helping ensure user privacy is through geographical k-anonymity, which attempts to ensure that every gathered data reading is geographically indistinguishable from k other readings. The approach helps prevent precise localization of the user or reverse engineering of reported data by leveraging the user’s known location. This paper presents a dynamic tesselation algorithm for k-anonymity that provides better privacy preservation and data reporting precision that previous static algorithms for k-anonymity. The paper presents empirical results from a real world data set that demonstrate the improvements in privacy provided by the algorithm. Index Terms—smartphone, privacy, data collection I. I NTRODUCTION Emerging Trends and Challenges. The number of de- ployed smartphone devices has grown rapidly in recent years to over 81 millions units in Q3 2010 [1], [2]. Significant interest in using smartphones as distributed data collection sys- tems has emerged because smartphones are widely deployed and have a number of computing capabilities that make them desirable for data collection systems, including: a plethora of sensors, plentiful processing and storage, a regular connection to the Internet, rapid application development through high- level programming languages, and frequent recharging by end-users. During the recent tragedy of the Gulf Oil Spill [3], multiple smartphone developers created applications that allowed citizens to help collect and report field data, such as images and text of oil-impacted animals or environments, to researchers and scientists. The success of these ‘citizen sci- entist’ applications has raised interest in smartphone-powered data collection, in which data measurements are collected by a number of smartphones dispersed over an environment of interest. Smartphone data collection systems are currently used for a variety of data collection applications, such as health mon- itoring [4], CO 2 emission tracking [5], traffic accident detec- tion [6], [7], traffic flow measurement [8], and cardiac patient monitoring [9]. Additionally, multiple middleware layers are being created to enable rapid creation and deployment of smartphone data collection applications [10]–[13]. Due to the increasing use of smartphones for data collection, multiple research efforts are attempting to quantify various aspects of using human-carried smartphones as sensors [10], [13], [14]. Open Problem ⇒ Location Data from Smartphone- powered Data Collection Systems can be Used to Invade the Personal Privacy of Users. One major challenge of using smartphones for data collection is the ability to leverage a user’s known location to determine what data was submitted by a user. For example, in a remote health monitoring system, if each health report includes the location of the user, then an attacker could follow the user and utilize the user’s location to determine which health report was his or hers. Conversely, if specific information about the user is known, such as hair color, eye color, and weight, the attacker could potentially use this information to filter the data reports and determine the user’s exact latitude and longitude. One promising approach of protecting user’s location data and helping to prevent location data from being used to find private user data is geographical k-anonymity [12]. Geographi- cal k-anonymity involves making an informed guess regarding the temporal and spatial distribution of incoming data, and using this assumption to logically break a geographical area into a number of regions [12] that each contain a minimal number of data readings, as shown in Figure 1. By sharing this tessellation map with end-user smartphone devices, they can report regional id’s instead of specific latitude/longitude locations. If the real-world incoming data distribution matches the assumption used to generate the regional tiles, then the incoming data will have the property of being k-anonymous, where at least k data readings are indistinguishable from one another [12], [15]. This ambiguity helps to protect an attacker from determining a user’s exact latitude/longitude or discovering their private data using their latitude/longitude. A good k-anonymity tessellation map is critical for pre- serving user privacy in a smartphone-powered data collection system. A key challenge with current tessellation approaches for k-anonymity is that they use a static, one-time tessellation based on predictions about the data that will enter the system. If the expected prediction used to generate the tessellation differs too much from the actual incoming data, the algorithms experience quality of service failures where either privacy is not preserved, or data precision is reduced needlessly. Even with an accurate prediction, unpredictable events such as disasters can radically alter the distribution of incoming data,