1 Neural Comb Filtering using Sliding Window Attention Network for Speech Enhancement P Venkatesh * , A Sivaganesh and K Sri Rama Murty Abstract—In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in deep neural network (DNN) based speech enhancement. We propose a sliding-window attention network to regress the spectral magni- tude mask (SMM) from the noisy speech signal. Even though the network parameters can be estimated by minimizing the mask loss, it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. The network parameters are estimated using a combination of the mask loss and cepstral loss. The pro- posed network architecture functions like an adaptive comb filter on voiced segments, and emphasizes the pitch harmonics in the speech spectrum. The proposed approach achieves comparable performance with the state-of-the-art methods with much lesser computational complexity 1 . Index Terms—Spectral magnitude mask, transformer, pitch harmonics, cepstral pitch peak and production-related loss. I. I NTRODUCTION Prolonged exposure to noisy speech signals causes severe fatigue to the listeners [1]. Speech Enhancement (SE) algo- rithms aim to improve the quality and intelligibility of the noisy speech signal, which is degraded by the background noise [2]. SE plays an important role in human-human commu- nication over mobile/radio channels, hearing aids, and cochlear implants [3] etc. SE is a crucial first step for voice assistants in domestic environment having multiple noise sources like television, microwave oven, competing speakers etc. Even though several methods have been proposed in the literature, SE still remains a challenging problem, especially for unseen speakers and noises. SE algorithms can be broadly classified into two categories: • Methods which rely on estimating the noise characteris- tics in order to suppress them. • Methods which rely on estimating the speech character- istics in order to highlight them. Most of the statistical SE algorithms rely on adaptive esti- mation of noise component and subtracting it from the noisy signal [4]–[6]. Deep neural network (DNN) approaches, on the other hand, rely on learning the structure of the speech signal through a non-linear mapping between noisy and clean speech signals [7]. As DNN approaches pose SE as a super- vised learning task, a variety of noises at different signal-to- noise-ratios (SNRs) can be used during the training. Hence, The authors are with Department of Electrical Engineering at Indian Institute of Technology Hyderabad, Hyderabad - 502285, Telangana, India. emails: parvathalavenkatesh123@gmail.com, ee18resch11020@iith.ac.in and ksrm@ee.iith.ac.in 1 Audio samples: https://siplab-iith.github.io/SWAN DNN approaches perform much better than the statistical approaches in unseen noises and lower SNR conditions [8], [9]. However, the superior performance of DNNs comes at the expense of huge computational complexity. There is a need to develop efficient low-complexity DNN architectures for low- power applications [10]. Network architecture and loss function play an important role in efficiently capturing the signal characteristics. Con- volutional and recurrent architectures have been commonly used in both time and frequency domains for SE [11]–[13]. However, the covolutional networks require more layers to increase the receptive field, while the recurrent networks are not suitable for parallel processing. Recent advances in trans- former architectures offer compact models to capture long- term dependencies through explicit attention mechanism [14]. Kim et. al. proposed a Gaussian weighted self attention transformer (TGSA) for SE [15]. Subsequently, Wang et. al. proposed a two stage transformer neural network (TSTNN) in the time-domain for SE [16]. Although TSTNN is a compact model, it requires huge computation as it operates on every sample in the time-domain. In this paper, we propose a sliding window attention network (SWAN) to estimate spectral mask in the frequency domain. As the proposed architecture operates on the frames of speech signal, it requires significantly lesser computation than TSTNN. The frequency domain approaches rely on estimating spec- tral mask from the noisy speech spectrum. The network parameters are updated using a combination of mask approx- imation loss and signal approximation loss [17]. State-of-the- art frequency-domain approaches to SE use perception-related loss functions for signal approximation [18]–[20]. Although the perception related loss functions, operating in Mel/Bark domains, are good at approximating the lower frequency bands, they do not perform well in the high-frequency bands. In this paper, we propose a production-related loss function for signal approximation. Speech signal exhibits high SNR regions in both time and frequency domains. The region around the glottal-closure instants (GCIs) exhibit higher amplitudes in the time domain, and hence, they are less vulnerable to the degradation [21]. As periodicity of GCIs transforms to pitch harmonics in the frequency domain, adaptive comb-filtering has been used in the literature to enhance them [22], [23]. In this work, we propose a novel loss function to better approximate the pitch harmonics in the spectral mask esti- mated using a DNN. The estimated spectral mask resembles a comb-filter and enhances the noisy speech spectrum across the frequency bands. The rest of the paper is organized as follows: Section II presents proposed transformer architecture