XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Enhancing Cyber Security Using Audio Techniques:
A Public Key Infrastructure for Sound
Anthony Phipps
Cyber Security Research Centre
London Metropolitan University
London, UK
arp0264@my.londonmet.ac.uk
Karim Ouazzane
Cyber Security Research Centre
London Metropolitan University
London, UK
k.ouazzane@londonmet.ac.uk
Vassil Vassilev
Cyber Security Research Centre
London Metropolitan University
London, UK
v.vassilev@londonmet.ac.uk
Abstract - This paper details the research into using audio
signal processing methods to provide authentication and
identification services for the purpose of enhancing cyber
security in voice applications. Audio is a growing domain for
cyber security technology. It is envisaged that over the next
decade, the primary interface for issuing commands to
consumer internet-enabled devices will be voice. Increasingly,
devices such as desktop computers, smart speakers, cars, TV’s,
phones and Internet of Things (IOT) devices all have built in
voice assistants and voice activated features. This research
outlines an approach to securely identify and authenticate users
of audio and voice operated systems that utilises audio
steganography in a method comparable to a PKI for sound and
existing cryptography methods whilst retaining the usability
associated with audio and voice driven systems.
Keywords - Authentication, Steganography, Two-factor
Authentication, Cyber Security, Audio Security
I. INTRODUCTION
Audio is a growing domain for cyber security technology.
It is envisaged that over the next decade, the primary interface
for issuing commands to consumer internet-enabled devices
will be voice. Increasingly, devices such as desktop
computers, smart speakers, cars, TV’s, phones and Internet of
Things (IOT) devices all have built in voice assistants and
voice activated features. Already in the context of digital
services, 50% of all adults have used voice for internet search
and there are over a billion voice searches per month. [1]
Powerful drivers such as accessibility, increased accuracy,
device design (screen less and keyboard less devices),
convenience and speed of communication will drive the trend
for increased use of audio and voice as a channel to interact
with information technology. [2] Additionally, verbal
communication is much faster than the typical typing speed of
the average person and recent advances in machine learning
for voice recognition and biometrics have improved accuracy
of this technology however, significant challenges remain to
enable this technology in high security and high reliability
environments. Despite the relatively inexpensive and non-
intrusive nature of voice and audio methods of authentication,
they are still relatively low performance in noisy
environments. [3] At the outset, the purpose of this research
was to investigate the new methods of identification and
authentication of users accessing IT systems based on audio
processing in a model that can contain factors such as
something you have, something you know, something you are,
and contextual information such as user ID, device, location,
background sounds, health information, emotional state,
combined with cryptographic information.
II. MOTIVATION AND RATIONALE
With voice increasingly becoming the interface of choice
for users of information systems, security techniques must
evolve. Many of today’s authentication and identification
solutions for voice channels (such as voice biometrics) have
serious limitations in terms of security and usability. [4] New
forms of attack are emerging that allow malicious actors to
gain covert access top voice-controlled systems and assistants
which are inaudible or in comprehensible to the human
owners of such systems. [5]
Figure II-1 Increasing Application for Voice Control in Cars
As can be seen in Figure II-1 Increasing Application for
Voice Control in Cars voice control can be as trivial as asking
for music to play to more safety related information such as
“are my brake pads still ok?” Research has drawn attention to
serious limitations of voice only interactions with smart
speakers and phones and the lack of command confirmation,
voice authentication and any additional authentication factors.
[6] [7] In addition, recent research has also shown it is
possible to use light to remotely inject inaudible and invisible
malicious commands into voice control enabled devices such
as smart speakers, tablets, and phones across large distances
and even through glass windows and from adjacent buildings.
[8] Imagine this scenario: A remote laser triggering your car
to come off autopilot whilst driving on the motorway. Even
with the addition of a vocal confirmation prompt, the attacker
could easily anticipate and add this.
Figure II-2 Setup for low-cost laser attack on Google Home
(Sugawara et al. 2019)