Novel approach of MFCC based alignment and WD-residual modiﬁcation for voice conversion using RBF Jagannath Nirmal a,n , Mukesh Zaveri b , Suprava Patnaik c , Pramod Kachare d a Department of Electronics Engineering, K.J.Somaiya College of Engineering. Mumbai 400077, India b Department of Computer Engineering, S.V.National Institute of Technology, Surat 395007, India c Department of Electronics Engineering, S.V.National Institute of Technology, Surat 395007, India d Department of Electronics Engineering, Veermata Jeejabai Institute of Technology, Mumbai 400031, India article info Article history: Received 7 March 2015 Received in revised form 30 May 2016 Accepted 31 July 2016 Communicated by R. Capobianco Guido Keywords: Dynamic time warping Gaussian mixture model LP-residual Line spectral frequencies Mel frequency cepstrum coefﬁcient Radial basis function Residual selection method and Wavelet packet transform abstract The voice conversion system modiﬁes the speaker speciﬁc characteristics of the source speaker to that of the target speaker, so it perceives like target speaker. The speaker speciﬁc characteristics of the speech signal are reﬂected at different levels such as the shape of the vocal tract, shape of the glottal excitation and long term prosody. The shape of the vocal tract is represented by Line Spectral Frequency (LSF) and the shape of glottal excitation by Linear Predictive (LP) residuals. In this paper, the fourth level wavelet packet transform is applied to LP-residual to generate the sixteen sub-bands. This approach not only reduces the computational complexity but also presents a genuine transformation model over state of the art statistical prediction methods. In voice conversion, the alignment is an essential process which aligns the features of the source and target speakers. In this paper, the Mel Frequency Cepstrum Coef- ﬁcients (MFCC) based warping path is proposed to align the LSF and LP-residual sub-bands using pro- posed constant source and constant target alignment. The conventional alignment technique is com- pared with two proposed approaches namely, constant source and constant target. Analysis shows that, constant source alignment using MFCC warping path performs slightly better than the constant target alignment and the state-of-the-art alignment approach. Generalized mapping models are developed for each sub-band using Radial Basis Function neural network (RBF) and are compared with Gaussian Mixture mapping model (GMM) and residual selection approach. Various subjective and objective eva- luation measures indicate signiﬁcant performance of RBF based residual mapping approach over the state-of-the-art approaches. & 2016 Elsevier B.V. All rights reserved. 1. Introduction The voice conversion system aims to adapt the acoustic char- acteristics of a given (i.e. source) speaker to a particular (i.e. target) speaker [1]. It employs two common stages: (i) training and (ii) transformation. In the training phase, voice conversion system identiﬁes and extracts the speaker speciﬁc features from the utterances of both the source and the target speaker. These source and target features are employed to formulate the mapping function for capturing the nonlinear relations between speaker speciﬁc features. Afterwards, the transformation phase employs the trained mapping function to modify the features of the source speaker so as to make it perceptually similar to that of a target speaker [2–4]. The training phase of voice conversion involves acoustic modelling, feature alignment and acoustic mapping. The acoustic modelling signiﬁes the shape of the vocal tract, shape of the glottal excitation and long term prosodic parameters [5–7]. Among these, the vocal tract parameters are relatively more pro- minent for identifying the speaker uniqueness than the source excitation parameters [6,8]. Various methods for feature extraction have been proposed in the literature to characterize the vocal tract parameters of the speech frame, namely, formant frequency [5], formant bandwidth [5,9], Linear Prediction Coding (LPC) [10], cepstrum coefﬁcient [11], Mel Cepstrum Envelope (MCEP) [12], Mel Generated Cep- strum (MGC) [1] and Line Spectral Frequencies (LSFs) [13–15]. Amongst these feature representations, LSF results in much more improved speech quality than any other features [15]. The glottal excitation signal is another important parameter conveying the essential information about speaker identity [8]. In high quality voice conversion system, the alignment of the source and target samples is of utmost importance to have parallel data prior to the estimation of mapping functions. Time Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2016.07.048 0925-2312/& 2016 Elsevier B.V. All rights reserved. n Corresponding author. E-mail addresses: jhnirmal@somaiya.edu (J. Nirmal), mazaveri@gmail.com (M. Zaveri), suprava_patnaik@yahoo.com (S. Patnaik), pramod_1991@yahoo.com (P. Kachare). Please cite this article as: J. Nirmal, et al., Novel approach of MFCC based alignment and WD-residual modiﬁcation for voice conversion using RBF, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.07.048i Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎