Real-time and Non-real-time Voice Conversion Systems with Web Interfaces Elias Azarov, Maxim Vashkevich, Denis Likhachov, Alexander Petrovsky Computer engineering department, Belarusian State University of Informatics and Radioelectronics, 6, P.Brovky str., 220013, Minsk, Belarus {azarov, vashkevich, likhachov, palex}@bsuir.by Abstract Two speech processing systems have been developed for real- time and non-real-time voice conversion. Using the real-time processing the user can apply conversion during voice over IP (VoIP) calls imitating identity of a specified target speaker. Non-real-time processing system converts prerecorded audio books read by a professional reader imitating voice of the user. Both systems require some speech samples of the user for training. The training procedures are similar for both systems however the user is considered as a source speaker in the first case and as a target speaker in the second. For parametric representation of speech we use a speech model based on instantaneous harmonic parameters with multicomponent sinusoidal excitation. The voice conversion itself is made using artificial neural networks (ANN) with rectified linear units. Here we demonstrate implementations of the voice conversion systems with dedicated web interfaces and iPhone application. Index Terms: voice conversion, VoIP, instantaneous speech parameters, neural networks 1. Introduction In this paper we present a voice conversion technique that has been implemented in two versions for real-time (referred to as 'CloneVoice') and non-real-time (referred to as 'CloneAudioBook') speech processing. CloneVoice is intended for VoIP communications and allows the user of the system to speak somebody else's voice. The current implementation of the system establishes VoIP to GSM connection using a voice conversion server as shown in figure 1. In order to get access to the voice conversion server a dedicated application is designed for iPhone. Figure 1: Schematic representation of real-time voice conversion using VoIP CloneAudioBook is applied to prerecorded audio books which are stored in a database. The audio book chosen by the user is processed by the voice conversion server and then can be downloaded using a web interface as shown in figure 2. The aim of the conversion is to change the voice of the original reader to the voice of the user. Figure 2: Schematic representation of non-real-time voice conversion for audio books Before conversion can be performed the user is asked to utter a set of phrases which are used for training of the voice conversion function. Both CloneVoice and CloneAudioBook systems are to become publicly available on August 2013 at http://clonevoice.com/en. The development of these voice conversion applications was inspired by the recent success of neural network applied to voice conversion [1] and recent advancement of contemporary speech morphing models [2]. 2. Implementation The system is divided into two main stages: training and conversion as shown in figure 3. For training parallel utterances of the source and target speakers are used. They are aligned in time and then the conversion function is trained using ANN. The conversion function matches features of the source speaker to those of target speaker. The training core is implemented in MATLAB and compiled into executables using the built-in compiler. Figure 3: Schematic representation of the voice conversion system During conversion the conversion function is applied to speech features and then the waveform of output speech is synthesized. Since the conversion stage is time critical it is implemented in C++. 2.1. Feature extraction and synthesis 2.1.1. Feature extraction For training and conversion a parametrical representation of speech is used. Instantaneous spectral envelope, pitch and excitation type (voiced, unvoiced or mixed) are extracted for each 5 ms of the signal. Deterministic/stochastic separation of the signal is performed in order to get an accurate estimation of spectral envelopes. First instantaneous spectral envelope is estimated from instantaneous harmonic parameters which are extracted using a DFT-modulated filter bank. Using calculated parameters the subband signals are classified as periodic or stochastic. From periodic subbands a harmonic part of the Copyright © 2013 ISCA 25 - 29 August 2013, Lyon, France INTERSPEECH 2013: Show & Tell Contribution 2662