2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New Paltz, NY
JOINT SINGING PITCH ESTIMATION AND VOICE SEPARATION
BASED ON A NEURAL HARMONIC STRUCTURE RENDERER
Tomoyasu Nakano
1
Kazuyoshi Yoshii
2
Yiming Wu
2
Ryo Nishikimi
2
Kin Wah Edward Lin
1
Masataka Goto
1
1
National Institute of Advanced Industrial Science and Technology (AIST), Japan
{t.nakano, edward.lin, m.goto}@aist.go.jp
2
Kyoto University, Japan
{yoshii, wu, nishikimi}@sap.ist.i.kyoto-u.ac.jp
ABSTRACT
This paper describes a multi-task learning approach to joint ex-
traction (fundamental frequency (F0) estimation) and separation of
singing voices from music signals. While deep neural networks
have been used successfully for each task, both tasks have not been
dealt with simultaneously in the context of deep learning. Since
vocal extraction and separation are considered to have a mutually
beneficial relationship, we propose a unified network that consists
of a deep convolutional neural network for vocal F0 saliency esti-
mation and a U-Net with an encoder shared by two decoders spe-
cialized for separating vocal and accompaniment parts, respectively.
Between these two networks we introduce a differentiable layer
that converts an F0 saliency spectrogram into harmonic masks in-
dicating the locations of harmonic partials of a singing voice. The
physical meaning of harmonic structure is thus reflected in the net-
work architecture. The harmonic masks are then effectively used as
scaffolds for estimating fine-structured masks thanks to the excel-
lent capability of the U-Net for domain-preserving conversion (e.g.,
image-to-image conversion). The whole network can be trained
jointly by backpropagation. Experimental results showed that the
proposed unified network outperformed the conventional indepen-
dent networks for vocal extraction and separation.
Index Terms— Melody extraction, F0 estimation, singing voice
separation, deep learning, multi-task learning
1. INTRODUCTION
A singing voice is one of the most influential elements of music
[1]. Accordingly, the estimation of its fundamental frequency (F0)
(a.k.a. vocal extraction or melody extraction) [2] and singing voice
separation (a.k.a. vocal separation) [3] have been actively investi-
gated in the field of music information retrieval (MIR). The state-of-
the-art studies have successfully used deep neural networks (DNNs)
for vocal extraction [4–7] and separation [8–16]. Bittner et al. [5],
for example, proposed a multi-F0 estimation method based on a
deep convolutional neural network (CNN) that estimates an F0 saliency
spectrogram from a music spectrogram in the constant-Q transform
(CQT) domain, and they applied that method to vocal extraction.
Jansson et al. [12] used a deep CNN variant with skip connections
called a U-Net [17] for estimating a soft mask spectrogram used
for separating a vocal spectrogram from a music spectrogram in the
short-time Fourier transform (STFT) domain.
This work was supported in part by JST ACCEL Grant Number JPM-
JAC1602 and JSPS KAKENHI No. JP17K12721 and No. 19H04137.
Input Output Network
freq. (2048)
freq. (2048)
freq. (360)
time (512)
freq. (360)
time (512)
time (512)
freq. (2048)
time (512)
time (512)
freq. (2048)
time (512)
Neural
harmonic
structure
renderer
Vocal
Accomp.
separation
network
F0 salience
estimation
network
HCQT
Harmonic structure
STFT
F0 saliency map
Vocal mask
Accompaniment mask
Figure 1: Our multi-task learning architecture consisting of a CNN
for vocal extraction and another CNN for vocal separation, between
which a neural harmonic structure renderer converts an estimated
F0 saliency spectrogram into a harmonic spectrogram in a differen-
tiable manner for guiding vocal separation.
The mutually beneficial relationship between vocal extraction
and separation has recently been leveraged for improving the per-
formances of both tasks. Caba˜ nas-Molero et al. [18], for example,
proposed a three-step method that performs rough vocal separation
based on stereo information, autocorrelation-based vocal extraction,
and F0-based vocal separation. Hsu et al. [19] proposed a tan-
dem algorithm that iterates vocal extraction and separation based
on signal processing techniques. To mitigate the error propagation
problem of such a cascading approach, Durrieu et al. [20] took a
machine-learning approach to joint vocal extraction and separation
based on source-filter nonnegative matrix factorization (NMF). Mu-
tually beneficial integration of DNN-based vocal extraction and sep-
aration, however, has not been achieved yet.
In this paper we propose a unified DNN that effectively com-
bines the deep CNN [5] with the U-Net [12] for joint vocal extrac-
tion and separation (Fig. 1). A basic way of connecting these two
networks is to warp the frequency scale of an F0 saliency spectro-
gram estimated by the CNN, stack it on a mixture spectrogram, and
feed the two-channel spectrogram into the U-Net. This approach,
however, does not incorporate the physical meaning of an F0, i.e.,
the fundamental knowledge that an F0 indicates an interval between
equally spaced harmonic partials, into the unified DNN. An essen-
tial research question here is how to parameterize such knowledge
978-1-7281-1123-0/19/$31.00 ©2019 IEEE 160