Towards Privacy-Preserving Speech Data Publishing Jianwei Qian * , Feng Han , Jiahui Hou * , Chunhong Zhang § , Yu Wang , Xiang-Yang Li * Department of Computer Science, Illinois Institute of Technology School of Computer Science and Technology, University of Science and Technology of China Department of Computer Science, University of North Carolina at Charlotte § School of Information and Communication Engineering, Beijing University of Posts and Telecommunications Abstract—Privacy-preserving data publishing has been a heated research topic in the last decade. Numerous ingenious attacks on users’ privacy and defensive measures have been proposed for the sharing of various data, varying from relational data, social network data, spatiotemporal data, to images and videos. Speech data publishing, however, is still untouched in the literature. To fill this gap, we study the privacy risk in speech data publishing and explore the possibilities of performing data sanitization to achieve privacy protection while preserving data utility simultaneously. We formulate this optimization problem in a general fashion and present thorough quantifications of privacy and utility. We analyze the sophisticated impacts of possible sanitization methods on privacy and utility, and also design a novel method – key term perturbation for speech content sanitization. A heuristic algorithm is proposed to personalize the sanitization for speakers to restrict their privacy leak (p-leak limit) while minimizing the utility loss. The simulations of linkage attacks and sanitization on real datasets validate the necessity and feasibility of this work. I. I NTRODUCTION We have witnessed the pervasiveness of voice-based human- computer interaction: input keyboards, web search, voice assistants, and voice authentication. These applications have brought numerous benefits to our daily lives. The big data era has spawned many data trading platforms since the immense value of big data has been prominently manifested. The sharing or publishing of speech data is also going to be an irresistible trend. From search engine tech giants to telecom companies, their speech data are not just mined to improve the service but could be shared to third-parties for profit as well. For instance, Samsung and Apple have admitted voice data sharing to third-parties [1], [2]. Meanwhile, speech data may also be published to foster research on spoken language analysis, e.g. TIMIT and NIST SRE datasets. Speech data contain a rich amount of information about the speakers that can be inferred by mining their search history and voice commands, including their demographics, preferences, online behaviors, living habits, and interpersonal relations. Some of such information might be sensitive to the speakers. For the sake of their privacy, all the personally identifiable information (PII) associated with the data to be published has to be removed, including names, phone numbers, and device IDs of the speakers. However, this is far from enough to prevent malicious third-parties (attackers) from undermining the speakers’ privacy. There is still a privacy risk called linkage attack, i.e. linking an anonymous speech recording to a real person to infer her sensitive information. This could be achieved by attackers with some background knowledge and reasoning ability. Linkage attacks at speech data may compromise the person’s privacy from four aspects. First, the content of the voice recording conveys a lot of demographics and life details about the person. Demographics that can be inferred include, but are not limited to, gender, age, education level, ethnicity, geographic region [6], social status [7], and personality [33]. The life details may leak privacy too, such as schedules the person added to Google calendar, products she purchased on Amazon Alexa, and even text messages and emails she wrote by voice input. From these details, the attacker is able to extract various private attributes and paint an accurate profile of this person. Second, the attacker can learn the person’s demographic categories by analyzing her voice solely. In fact, an intimidating number of demographics can be mined from voice, referred to as voice attributes, such as age, gender, ethnicity, geographic region (accent), height [15], emotion [19], and even health condition [28]. Third, the person’s voiceprint is leaked. Voiceprint as a type of biometrics is widely applied in emerging systems for authentication. Unlike password that can be changed once stolen, voiceprint is unchangeable. Once it is leaked to a miscreant, we will never feel safe again to adopt voice au- thentication to secure our properties due to the fear of identity theft. Finally, the attacker gets to know the fact that the person belongs to this dataset, which might be sensitive, for example when the dataset is a collection of utterances of heart disease patients. This is known as membership privacy. The leak of voiceprint further results in three security risks. The first is identity theft as aforementioned. The attacker may commit spoofing attacks to voice authentication systems [32]. Besides, the victim could suffer from reputation attacks. The attacker can fabricate recordings that sound like the victim’s voice but has indecent or illegal content, to damage her rep- utation or frame her up, e.g. fake Obama speech 1 . Moreover, the victim may experience fraud attacks. The attacker may use her voice to agree on some terms to sign up for paid service and authorize bogus charges on a credit card. 2 We seek to answer four questions. 1) What is the potential risk of privacy leak in speech data publishing? 2) How to design sanitization methods suitable for speech data? 3) How to quantify their influence on privacy level and data utility? 1 Fake Obama speech, https://goo.gl/pnR3VK 2 The ‘Can you hear me?’ fraud, https://goo.gl/Wy3e7u