Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks Akihiro Kato, Tomi Kinnunen University of Eastern Finland akihiro.kato@uef.fi, tomi.kinnunen@uef.fi Abstract The fundamental frequency (F 0) represents pitch in speech that determines prosodic characteristics of speech and is needed in various tasks for speech analysis and synthesis. Despite decades of research on this topic, F 0 estimation at low signal-to-noise ratios (SNRs) in unexpected noise conditions remains difficult. This work proposes a new approach to noise robust F 0 estima- tion using a recurrent neural network (RNN) trained in a su- pervised manner. Recent studies employ deep neural networks (DNNs) for F 0 tracking as a frame-by-frame classification task into quantised frequency states but we propose waveform-to- sinusoid regression instead to achieve both noise robustness and accurate estimation with increased frequency resolution. Experimental results with PTDB-TUG corpus contaminated by additive noise (NOISEX-92) demonstrate that the proposed method improves gross pitch error (GPE) rate and fine pitch error (FPE) by more than 35 % at SNRs between -10 dB and +10 dB compared with well-known noise robust F 0 tracker, PEFAC. Furthermore, the proposed method also outperforms state-of-the-art DNN-based approaches by more than 15 % in terms of both FPE and GPE rate over the preceding SNR range. Index Terms: F 0 estimation, pitch estimation, prosody analy- sis, voice activity detection, recurrent neural networks 1. Introduction Fundamental frequency (F 0) is the lowest frequency in a quasi- periodic signal. It represents pitch in speech that determines prosodic characteristics of speech. Therefore, F 0 is one of the key features of speech and F 0 estimation is vital for many ap- plications, e.g. voice conversion [1], speaker and language iden- tification [2, 3], prosody analysis [4], speech coding [5], speech synthesis [6] and speech enhancement [7, 8]. Over the past decades, various approaches to F 0 estima- tion have been proposed. Specifically, robust algorithm for pitch tracking (RAPT) [9] and YIN [10] that track F 0 from time-domain signals have been widely used in many applica- tions showing high accuracy [11]. These methods, however, do not attain satisfactory performance under noisy conditions [12]. Thus, several more noise robust methods have been proposed. For instance, pitch estimation filter with amplitude compres- sion (PEFAC) [13] tends to outperform both RAPT and YIN in terms of noise robustness. It analyses noisy signals in the log- frequency domain with a matched filter and normalisation with the universal long-term average speech spectrum. Nonetheless, it remains challenging to obtain satisfactory estimates of F 0 at low signal-to-noise ratios (SNRs) such as 0 dB and below. In addition to such real-time digital signal processing (DSP) methods, various machine learning approaches using Gaussian mixture models (GMMs) and hidden Markov models (HMMs) [14, 15], for example, have been developed for noise robust F 0 estimation. Furthermore, recent research has successfully ap- plied deep neural networks (DNNs) and their variants, e.g. con- volutional neural networks (CNNs) and recurrent neural net- works (RNNs), to improve F 0 estimation in severe noise con- ditions [6, 16, 17]. DNNs derive discriminative models to rep- resent arbitrarily complex mapping functions as long as they comprise enough number of units in their hidden layers. Con- sequently, they enable statistical models to deal with higher di- mensional input features having stronger correlation than the preceding approaches. Recently, another technical trend in acoustic modelling has emerged since a remarkable achievement of WaveNet [18], which analyses time-domain waveforms directly instead of ex- tracting spectral or cepstral features from speech. This con- tributed to not only advancement in speech synthesis but also end-to-end modelling for various speech applications that do not require traditional Fourier analysis [18]. Direct analysis of waveforms is also beneficial for denoising of speech that usually combines noisy phase spectra with enhanced magnitude spectra to reconstruct clean speech [19]. In fact, the latest research has applied direct time-domain waveform analysis to F 0 estimation with DNN-based [20] and CNN-based [21] approaches showing improved noise robust- ness over both the conventional real-time signal processing and the recent DNN-based spectral analysis. These state-of-the-art time-domain F 0 estimators, however, still have a problem to be solved: they employ DNNs or CNNs to form a frame-by-frame classification model to decide a state corresponding to a quan- tised frequency. Even if it is convenient to treat F 0 tracking as a classification task in the same manner as alignment of senones in speech recognition, the resultant estimates of F 0 contours have a limited frequency resolution determined by the number of quantised frequency states. This is a potential draw-back in terms of estimation accuracy of F 0. This work is an extension of our recent preliminary study [22]. In that study, we have successfully employed an RNN re- gression model, which maps spectral sequence directly onto F 0 values, to tackle the disadvantage in existing classification ap- proaches mentioned above. In relation to that preliminary study, the present paper represents the following four major changes. First, we employ direct waveform inputs instead of spectral se- quence. Second, we propose a novel encoding method of the F 0 information using a simple sinusoid oscillated with the ground truth value of F 0. This encoding enables our model to map raw speech waveforms to raw sinusoids without requirement of nei- ther pre-processing nor post-processing. Next, we amend our experiments with very recent competitive methods which are also based on waveform input schemes [20, 21]. Finally, we augmented noise conditions for the experiments in order to ex- amine noise robustness against more various noise types. Con- sequently, a known noise condition is increased from consisting of six noise types to eight types while an unknown noise condi- tion is augmented from two types to four types. arXiv:1807.00752v1 [eess.AS] 2 Jul 2018