505 978-1-7281-5761-0/20/$31.00 ©2020 IEEE
Voice Antispoofing System Vulnerabilities Research
Aleksandr M. Sinitca
1
, Nikita V. Efimchik,
Evgeniy D. Shalugin, Vladimir A. Toropov
Faculty of Computer Science and Technology
Saint Petersburg Electrotechnical University "LETI"
St. Petersburg, Russia
1
amsinitca@etu.ru
Konstantin Simonchik
ID R&D Inc.
New York, USA
simonchik@idrnd.net
Abstract— Recently, the problem of protecting information
systems from various types of spoofing is gaining relevance. The
article presents a study of the voice anti-spoofing system for the
search for vulnerabilities to text-to-speech attack. As part of the
study, a new test dataset was created for the voice anti-spoofing
system, which includes about 150,000 audio from more than
15,000 phrases in 25 languages by 8 TTS engines. The study
showed uneven recognition quality depending on the voice of the
text-to-speech converter and vulnerability to signal noise, which
indicates the features of the detector. The results will allow
improving the quality of detection of text-to-speech converters.
Keywords— antispoofing; vulnerabilities; text-to-speech
I. INTRODUCTION
Nowadays, technologies related to the synthesis and
modeling of speech are developing very quickly, allowing you
to create voice recordings almost indistinguishable from real
ones. Such services are called Text-to-Speech (TTS). That’s
why the problem of protecting different systems from this type
of attack is now one of the most relevant. A large number of
scientists are busy in developing algorithms that would be able
to distinguish the synthesized voice of the machine from the
real one. These algorithms need to be thoroughly tested to
make sure that the system really works and these tests are
highly important the same as quality and diversity of these
tests. To turn text into speech, 6 products were used: IBM
Cloud API, Google Cloud Platform, Baidu TTS, Amazon
Polly, Yandex SpeechKit, MaryTTS. In IBM Cloud API and
Google Cloud Platform, in addition to сonventional speech
engines, deep neural network engines were also used.
II. SYSTEM UNDER TEST
There are no ready-made serious solutions for voice
protection from spoofing in open sources. There are so-called
"anti-fraud" solutions that analyze audio, voice, behavior and
metadata to create risk assessments of calls and customer
credentials. And, while the solution implies the possible
existence of anti-spoofing systems, nowhere does it say that
they actually exist, and if so, how they work. Moreover, such
solutions are largely based on telephony methods. There are
many commercial anti-spoofing solutions, such as Microsoft,
Nuance, STC solutions and many others, but there are
obviously no open access to their systems. There are also freely
available articles describing solutions from various
competitions, such as the AVSspoof competition [1]. However,
the finished products for these solutions are not publicly
available.
The study tested a system that uses the most common
techniques and methods of building anti-spoofing systems. The
system that showed the best results at the ASVspoof 2019
competition [2] was chosen as the target for the research of the
vulnerabilities of anti-spoofing systems. System was used in
Python API version.
III. TESTING METHODOLOGY
Testing was carried out in 3 stages. First, a corpus of texts
in N different languages was collected. In the next step, all the
prepared phrases were turned into audio recordings (in wav
format) using the TTS services mentioned earlier. These audio
recordings were then processed by the test system, which
determined which were human speech and which were
artificially synthesized speech. At the next stage, a certain
number of recordings that did not pass the test were selected
(the system defined them as artificially synthesized audio
recordings) and "white noise" was imposed on them. The
augmented audio recordings were again processed by the
system under test. Eventually, an SDK was developed to
automate the testing of anti-spoofing systems, and on the basis
of the tests conducted, the attack resistance of the system under
study based on the most popular methods of building anti-
spoofing systems was analyzed.
To search for vulnerabilities in the proposed anti-spoofing
system using a text-to-speech vector, the path was chosen to
create your own dataset based on publicly available texts in
various languages for a set of main languages, as well as
inheritance of the speech corpus from the translation dataset
from materials of the European Parliament [3] containing 20
language pairs In addition, the Chinese language uses the
textitData-Baker’s TTS data dataset.
Taking into account the available languages and votes from
the considered TTS, the following dataset was received:
• IBM Cloud API: de: 540, en: 2496, es: 1508, fr: 784, it:
542, ja: 97, pt: 748. Total: 6715
• Google Cloud API: ar-XA: 600, cs-CZ: 468, da-DK:
522, de-DE: 1560, el-GR: 532, en-GB: 2496, en-US:
3120, es-ES: 377,.fi-FI: 528, fr-FR: 3136, hu-HU: 462.
it-IT: 2168, ja-JP: 776, nl-NL: 2670, pl-PL: 2390, pt-