2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY JOINT SINGING PITCH ESTIMATION AND VOICE SEPARATION BASED ON A NEURAL HARMONIC STRUCTURE RENDERER Tomoyasu Nakano 1 Kazuyoshi Yoshii 2 Yiming Wu 2 Ryo Nishikimi 2 Kin Wah Edward Lin 1 Masataka Goto 1 1 National Institute of Advanced Industrial Science and Technology (AIST), Japan {t.nakano, edward.lin, m.goto}@aist.go.jp 2 Kyoto University, Japan {yoshii, wu, nishikimi}@sap.ist.i.kyoto-u.ac.jp ABSTRACT This paper describes a multi-task learning approach to joint ex- traction (fundamental frequency (F0) estimation) and separation of singing voices from music signals. While deep neural networks have been used successfully for each task, both tasks have not been dealt with simultaneously in the context of deep learning. Since vocal extraction and separation are considered to have a mutually beneﬁcial relationship, we propose a uniﬁed network that consists of a deep convolutional neural network for vocal F0 saliency esti- mation and a U-Net with an encoder shared by two decoders spe- cialized for separating vocal and accompaniment parts, respectively. Between these two networks we introduce a differentiable layer that converts an F0 saliency spectrogram into harmonic masks in- dicating the locations of harmonic partials of a singing voice. The physical meaning of harmonic structure is thus reﬂected in the net- work architecture. The harmonic masks are then effectively used as scaffolds for estimating ﬁne-structured masks thanks to the excel- lent capability of the U-Net for domain-preserving conversion (e.g., image-to-image conversion). The whole network can be trained jointly by backpropagation. Experimental results showed that the proposed uniﬁed network outperformed the conventional indepen- dent networks for vocal extraction and separation. Index Terms— Melody extraction, F0 estimation, singing voice separation, deep learning, multi-task learning 1. INTRODUCTION A singing voice is one of the most inﬂuential elements of music [1]. Accordingly, the estimation of its fundamental frequency (F0) (a.k.a. vocal extraction or melody extraction) [2] and singing voice separation (a.k.a. vocal separation) [3] have been actively investi- gated in the ﬁeld of music information retrieval (MIR). The state-of- the-art studies have successfully used deep neural networks (DNNs) for vocal extraction [4–7] and separation [8–16]. Bittner et al. [5], for example, proposed a multi-F0 estimation method based on a deep convolutional neural network (CNN) that estimates an F0 saliency spectrogram from a music spectrogram in the constant-Q transform (CQT) domain, and they applied that method to vocal extraction. Jansson et al. [12] used a deep CNN variant with skip connections called a U-Net [17] for estimating a soft mask spectrogram used for separating a vocal spectrogram from a music spectrogram in the short-time Fourier transform (STFT) domain. This work was supported in part by JST ACCEL Grant Number JPM- JAC1602 and JSPS KAKENHI No. JP17K12721 and No. 19H04137. Input Output Network freq. (2048) freq. (2048) freq. (360) time (512) freq. (360) time (512) time (512) freq. (2048) time (512) time (512) freq. (2048) time (512) Neural harmonic structure renderer Vocal Accomp. separation network F0 salience estimation network HCQT Harmonic structure STFT F0 saliency map Vocal mask Accompaniment mask Figure 1: Our multi-task learning architecture consisting of a CNN for vocal extraction and another CNN for vocal separation, between which a neural harmonic structure renderer converts an estimated F0 saliency spectrogram into a harmonic spectrogram in a differen- tiable manner for guiding vocal separation. The mutually beneﬁcial relationship between vocal extraction and separation has recently been leveraged for improving the per- formances of both tasks. Caba˜ nas-Molero et al. [18], for example, proposed a three-step method that performs rough vocal separation based on stereo information, autocorrelation-based vocal extraction, and F0-based vocal separation. Hsu et al. [19] proposed a tan- dem algorithm that iterates vocal extraction and separation based on signal processing techniques. To mitigate the error propagation problem of such a cascading approach, Durrieu et al. [20] took a machine-learning approach to joint vocal extraction and separation based on source-ﬁlter nonnegative matrix factorization (NMF). Mu- tually beneﬁcial integration of DNN-based vocal extraction and sep- aration, however, has not been achieved yet. In this paper we propose a uniﬁed DNN that effectively com- bines the deep CNN [5] with the U-Net [12] for joint vocal extrac- tion and separation (Fig. 1). A basic way of connecting these two networks is to warp the frequency scale of an F0 saliency spectro- gram estimated by the CNN, stack it on a mixture spectrogram, and feed the two-channel spectrogram into the U-Net. This approach, however, does not incorporate the physical meaning of an F0, i.e., the fundamental knowledge that an F0 indicates an interval between equally spaced harmonic partials, into the uniﬁed DNN. An essen- tial research question here is how to parameterize such knowledge 978-1-7281-1123-0/19/$31.00 ©2019 IEEE 160