XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE Enhancing Cyber Security Using Audio Techniques: A Public Key Infrastructure for Sound Anthony Phipps Cyber Security Research Centre London Metropolitan University London, UK arp0264@my.londonmet.ac.uk Karim Ouazzane Cyber Security Research Centre London Metropolitan University London, UK k.ouazzane@londonmet.ac.uk Vassil Vassilev Cyber Security Research Centre London Metropolitan University London, UK v.vassilev@londonmet.ac.uk Abstract - This paper details the research into using audio signal processing methods to provide authentication and identification services for the purpose of enhancing cyber security in voice applications. Audio is a growing domain for cyber security technology. It is envisaged that over the next decade, the primary interface for issuing commands to consumer internet-enabled devices will be voice. Increasingly, devices such as desktop computers, smart speakers, cars, TV’s, phones and Internet of Things (IOT) devices all have built in voice assistants and voice activated features. This research outlines an approach to securely identify and authenticate users of audio and voice operated systems that utilises audio steganography in a method comparable to a PKI for sound and existing cryptography methods whilst retaining the usability associated with audio and voice driven systems. Keywords - Authentication, Steganography, Two-factor Authentication, Cyber Security, Audio Security I. INTRODUCTION Audio is a growing domain for cyber security technology. It is envisaged that over the next decade, the primary interface for issuing commands to consumer internet-enabled devices will be voice. Increasingly, devices such as desktop computers, smart speakers, cars, TV’s, phones and Internet of Things (IOT) devices all have built in voice assistants and voice activated features. Already in the context of digital services, 50% of all adults have used voice for internet search and there are over a billion voice searches per month. [1] Powerful drivers such as accessibility, increased accuracy, device design (screen less and keyboard less devices), convenience and speed of communication will drive the trend for increased use of audio and voice as a channel to interact with information technology. [2] Additionally, verbal communication is much faster than the typical typing speed of the average person and recent advances in machine learning for voice recognition and biometrics have improved accuracy of this technology however, significant challenges remain to enable this technology in high security and high reliability environments. Despite the relatively inexpensive and non- intrusive nature of voice and audio methods of authentication, they are still relatively low performance in noisy environments. [3] At the outset, the purpose of this research was to investigate the new methods of identification and authentication of users accessing IT systems based on audio processing in a model that can contain factors such as something you have, something you know, something you are, and contextual information such as user ID, device, location, background sounds, health information, emotional state, combined with cryptographic information. II. MOTIVATION AND RATIONALE With voice increasingly becoming the interface of choice for users of information systems, security techniques must evolve. Many of today’s authentication and identification solutions for voice channels (such as voice biometrics) have serious limitations in terms of security and usability. [4] New forms of attack are emerging that allow malicious actors to gain covert access top voice-controlled systems and assistants which are inaudible or in comprehensible to the human owners of such systems. [5] Figure II-1 Increasing Application for Voice Control in Cars As can be seen in Figure II-1 Increasing Application for Voice Control in Cars voice control can be as trivial as asking for music to play to more safety related information such as “are my brake pads still ok?” Research has drawn attention to serious limitations of voice only interactions with smart speakers and phones and the lack of command confirmation, voice authentication and any additional authentication factors. [6] [7] In addition, recent research has also shown it is possible to use light to remotely inject inaudible and invisible malicious commands into voice control enabled devices such as smart speakers, tablets, and phones across large distances and even through glass windows and from adjacent buildings. [8] Imagine this scenario: A remote laser triggering your car to come off autopilot whilst driving on the motorway. Even with the addition of a vocal confirmation prompt, the attacker could easily anticipate and add this. Figure II-2 Setup for low-cost laser attack on Google Home (Sugawara et al. 2019)