Real-time and Non-real-time Voice Conversion Systems with Web Interfaces
Elias Azarov, Maxim Vashkevich, Denis Likhachov, Alexander Petrovsky
Computer engineering department, Belarusian State University of Informatics and Radioelectronics,
6, P.Brovky str., 220013, Minsk, Belarus
{azarov, vashkevich, likhachov, palex}@bsuir.by
Abstract
Two speech processing systems have been developed for real-
time and non-real-time voice conversion. Using the real-time
processing the user can apply conversion during voice over IP
(VoIP) calls imitating identity of a specified target speaker.
Non-real-time processing system converts prerecorded audio
books read by a professional reader imitating voice of the user.
Both systems require some speech samples of the user for
training. The training procedures are similar for both systems
however the user is considered as a source speaker in the first
case and as a target speaker in the second. For parametric
representation of speech we use a speech model based on
instantaneous harmonic parameters with multicomponent
sinusoidal excitation. The voice conversion itself is made
using artificial neural networks (ANN) with rectified linear
units. Here we demonstrate implementations of the voice
conversion systems with dedicated web interfaces and iPhone
application.
Index Terms: voice conversion, VoIP, instantaneous speech
parameters, neural networks
1. Introduction
In this paper we present a voice conversion technique that has
been implemented in two versions for real-time (referred to as
'CloneVoice') and non-real-time (referred to as
'CloneAudioBook') speech processing. CloneVoice is intended
for VoIP communications and allows the user of the system to
speak somebody else's voice. The current implementation of
the system establishes VoIP to GSM connection using a voice
conversion server as shown in figure 1. In order to get access
to the voice conversion server a dedicated application is
designed for iPhone.
Figure 1: Schematic representation of real-time voice
conversion using VoIP
CloneAudioBook is applied to prerecorded audio books
which are stored in a database. The audio book chosen by the
user is processed by the voice conversion server and then can
be downloaded using a web interface as shown in figure 2.
The aim of the conversion is to change the voice of the
original reader to the voice of the user.
Figure 2: Schematic representation of non-real-time
voice conversion for audio books
Before conversion can be performed the user is asked to
utter a set of phrases which are used for training of the voice
conversion function.
Both CloneVoice and CloneAudioBook systems are to
become publicly available on August 2013 at
http://clonevoice.com/en.
The development of these voice conversion applications
was inspired by the recent success of neural network applied
to voice conversion [1] and recent advancement of
contemporary speech morphing models [2].
2. Implementation
The system is divided into two main stages: training and
conversion as shown in figure 3. For training parallel
utterances of the source and target speakers are used. They are
aligned in time and then the conversion function is trained
using ANN. The conversion function matches features of the
source speaker to those of target speaker. The training core is
implemented in MATLAB and compiled into executables
using the built-in compiler.
Figure 3: Schematic representation of the voice
conversion system
During conversion the conversion function is applied to
speech features and then the waveform of output speech is
synthesized. Since the conversion stage is time critical it is
implemented in C++.
2.1. Feature extraction and synthesis
2.1.1. Feature extraction
For training and conversion a parametrical representation of
speech is used. Instantaneous spectral envelope, pitch and
excitation type (voiced, unvoiced or mixed) are extracted for
each 5 ms of the signal.
Deterministic/stochastic separation of the signal is
performed in order to get an accurate estimation of spectral
envelopes. First instantaneous spectral envelope is estimated
from instantaneous harmonic parameters which are extracted
using a DFT-modulated filter bank. Using calculated
parameters the subband signals are classified as periodic or
stochastic. From periodic subbands a harmonic part of the
Copyright © 2013 ISCA 25 - 29 August 2013, Lyon, France
INTERSPEECH 2013: Show & Tell Contribution
2662