2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 17-20, 2021, New Paltz, NY DF-CONFORMER: INTEGRATED ARCHITECTURE OF CONV-TASNET AND CONFORMER USING LINEAR COMPLEXITY SELF-ATTENTION FOR SPEECH ENHANCEMENT Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, John R. Hershey, Llion Jones, Michiel Bacchiani Google Research ABSTRACT Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an anal- ysis/synthesis ﬁlterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising per- formance and computational efﬁciency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequential modeling ability of Conv-TasNet archi- tectures by integrating Conformer layers into a new mask predic- tion network. To make the model computationally feasible, we ex- tend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers. We trained the model on 3,396 hours of noisy speech data, and show that (i) the use of lin- ear complexity attention avoids high computational complexity, and (ii) our model achieves higher scale-invariant signal-to-noise ratio than the improved time-dilated convolution network (TDCN++), an extended version of Conv-TasNet. Index Terms— Speech enhancement, Conv-TasNet, Con- former, dilated convolution, self-attention. 1. INTRODUCTION Speech enhancement (SE) is the task of recovering target speech from a noisy signal [1]. In addition to its applications in telephony and video conferencing [2], single-channel SE is a basic compo- nent in larger systems, such as multi-channel SE [3, 4], multi-modal SE [5–8], and automatic speech recognition (ASR) [9–11] systems. Therefore, it is important to improve both the denoising perfor- mance and the computational efﬁciency of single-channel SE. In recent years, rapid progress has been made on SE using deep neural networks (DNNs) [1]. Conv-TasNet [12] is a powerful model for SE that uses a combination of trainable analysis/synthesis ﬁlter- banks [13] and a mask prediction network using stacked 1-D dilated depthwise convolution (1D-DDC) layers. Since the denoising per- formance and computational efﬁciency are mainly affected by the mask prediction network, one of the main research topics in SE is improving the mask prediction architecture [14–20]. For example, the improved time-dilated convolution network (TDCN++) [14,15] extended Conv-TasNet to improve SE performance. A promising candidate for improving mask prediction networks is the Conformer architecture. The Conformer [21] architecture has been shown to be effective in ASR [21], diarization [22], and sound event detection [23, 24]. Conformer is derived from the Trans- former [25] architecture by including 1-D depthwise convolution layers to enable more effective sequential modeling. In this paper we combine Conformer layers with the dilated convolution layers of the TDCN++ architecture. However, this in- troduces two critical problems related to the short window and hop sizes used in trainable analysis/synthesis ﬁlterbanks. The ﬁrst prob- lem is large computational cost because the time-complexity of the multi-head-self-attention (MHSA) in the Conformer has a quadratic dependence on sequence length. Secondly, the small hop-size of neighboring time-frames reduces the temporal reach of sequential modeling when using temporal convolution layers. In order to make the model computationally feasible, we use a linear-complexity variant of self-attention in the Conformer, known as fast attention via positive orthogonal random features (FA- VOR+), as used in Performer [26]. These ideas are partly inspired by the local-global network for speaker diarization using a time- dilated convolution network (TDCN) [22] which shows that the combination of a linear complexity self-attention and a TDCN im- proves both local and global sequential modeling. We show in ex- periments below that the resulting model, which we call the dilated FAVOR Conformer (DF-Conformer), achieves better enhancement ﬁdelity than the TDCN++ of comparable complexity. 2. PRELIMINARIES 2.1. Conv-TasNet and its extensions on speech enhancement Let the T -sample time-domain observation x ∈ R T be a mixture of a target speech s and noise n as x = s + n, where n is assumed to be environmental noise and does not include interference speech signals. The goal of SE is to recover s from x. In mask-based SE, a mask is estimated using a mask prediction network and applied to the representation of x encoded by an en- coder, then the estimated signal y ∈ R T is re-synthesized using a decoder. The enhancement procedure can be written as y = Dec (Enc(x) ⊙M(Enc(x))) (1) where Enc : R T → R N×De and Dec : R N×De → R T are the signal encoder and decoder, respectively, De is the encoder output dimension, ⊙ is the element-wise multiplication, and M : R N×De → [0, 1] N×De is the mask prediction network. Early stud- ies used the short-time-Fourier-transform (STFT) and the inverse- STFT (iSTFT) as encoder and decoder [9, 27], respectively. More recent studies use a trainable encoder/decoder [12] which are often called trainable “ﬁlterbanks” [28], e.g. in Asteroid [13]. One of the main research topic in SE is the design of the network architecture of M, because the performance and compu- tational efﬁciency of SE are mainly affected by the structure of M. Conv-TasNet [12] is a powerful model for speech separa- tion and SE, and whose M consists of stacked 1D-DDC layers. TDCN++ [14,15] is an extension of Conv-TasNet. The main dif- ference of TDCN++ with Conv-TasNet is the use of instance norm instead of global layer norm and the addition of explicit scale pa- rameters after each dense layer. The pseudo-code for M in the arXiv:2106.15813v2 [eess.AS] 5 Aug 2021