Is This You? Identifying a Mobile User Using Only Diagnostic Features Anthony Quattrone, Tanusri Bhattacharya, Lars Kulik, Egemen Tanin, James Bailey The University of Melbourne quattronea,tbhattachary,lkulik,etanin,baileyj@unimelb.edu.au ABSTRACT Mobile smart phones capture a great amount of informa- tion about a user across a variety of different data domains. This information can be sensitive and allow for identifying a user profile, thus causing potential threats to a user’s pri- vacy. Our work shows that diagnostic information that is not considered sensitive, could be used to identify a user af- ter just three consecutive days of monitoring. We have used the Device Analyzer dataset to determine what features of a mobile device are important in identifying a user. Many mobile games and applications collect diagnostic data as a means of identifying or resolving issues. Diagnostic data is commonly accepted as less sensitive information. Our experimental results demonstrate that using only diagnostic features like hardware statistics and system settings, a user’s device can be identified at an accuracy of 94% with a Naive Bayes classifier. Categories and Subject Descriptors I.5 [Pattern Recognition]: Models; H.2.8 [Database Ap- plications]: Data Mining Keywords Mobile Privacy; Predictive Modeling; Inference Attacks; Mo- bile Analytics 1. INTRODUCTION Modern smart phones capture a great amount of personal information about a user across a variety of different data domains. Extractable diagnostic features could be collected on a regular basis by a large number of mobile apps by the fact that many smart phones have Internet access. Mobile app marketplaces such as Google Play and Apple App Store are convenient for both the application develop- ers and the mobile users providing centralized services for downloading third party applications. This has led to an Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. MUM ’14, November 25 - 28 2014, Melbourne, VIC, Australia Copyright 2014 ACM 978-1-4503-3304-7/14/11 ...$15.00. http://dx.doi.org/10.1145/2677972.2677999 explosion of mobile application development and their us- age [11]. Many of these applications capture raw data from a user’s device and upload it to a remote database in or- der to deliver certain services. While these applications can provide significant benefit to the users, they can also impose potential risk to disclose sensitive user information. In smart phone applications, location data collected via GPS, Wi-Fi, RFID or Bluetooth sensors is considered as the most sensitive information causing the most severe privacy risks [8, 5, 14, 16]. Sensitive personal data can also be cap- tured through camera, microphone, accelerometer sensors installed in smart phones. There is also other seemingly less-sensitive information such as hardware statistics or sys- tem settings that could be easily accessed. These features can be easily extracted by a mobile application like Device Analyzer [1]. In this paper, we have analyzed the Device Analyzer dataset [15] to see what features are important in order to identify a user’s device other than the obvious sensitive information. To make the data more accessible for our analysis, we first transformed the raw dataset to a aggregated dataset to pro- vide context about each user at a daily level. A web applica- tion has been developed as a part of this aggregation process to describe the daily level context of a Device Analyzer user. We have modeled a Naive Bayes classifier to learn a user’s device using less sensitive features such as hardware statis- tics. Our experiment shows that using only information like manufacturer name, internal and external memory usage and system settings, a user profile can be predicted at an accuracy of 94%. Only three consecutive days of monitoring diagnostic features are necessary to identify a user profile, a time period that is short for normal app usage. A mobile app has access to direct features that can uniquely identify a device such as a WIFI mac address, however diagnostic in- formation used in our experiments is less suspected in posing as a private threat and more widely distributed to remote servers. For example, a mobile hardware manufacturer com- pany may have access to the usage of harware statistics of its customers for analysing the performance. This finding is a threat to user privacy as an adversary could learn the identity of a user profile given they have access to an additional dataset that contains the user’s name or if a user moves to pay for a service and reveals their name to complete a transaction. Once the identity of a user is 240