Received 14 September 2023, accepted 7 October 2023, date of publication 12 October 2023, date of current version 19 October 2023. Digital Object Identifier 10.1109/ACCESS.2023.3324210 Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement NASIR SALEEM 1,2 , TEDDY SURYA GUNAWAN 2,3 , (Senior Member, IEEE), MUHAMMAD SHAFI 4 , SAMI BOUROUIS 5 , AND AYMEN TRIGUI 6 1 Department of Electrical Engineering, Faculty of Engineering and Technology, Gomal University, Dera Ismail Khan 29050, Pakistan 2 Information Systems Department, International Islamic University Malaysia (IIUM), Kuala Lumpur 53100, Malaysia 3 School of Electrical Engineering, Telkom University, Bandung 40257, Indonesia 4 Faculty of Computing and Information Technology, Sohar University, Sohar 311, Oman 5 Department of Information Technology, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia 6 Department of Computer Science, College of Computer Science, King Khalid University, Abha 61421, Saudi Arabia Corresponding author: Nasir Saleem (nasirsaleem@gu.edu.pk) This work was supported by the Deanship of Scientific Research, Taif University. ABSTRACT Convolutional encoder-decoder (CED) has emerged as a powerful architecture, particularly in speech enhancement (SE), which aims to improve the intelligibility and quality and intelligibility of noise-contaminated speech. This architecture leverages the strength of the convolutional neural networks (CNNs) in capturing high-level features. Usually, the CED architectures use the gated recurrent unit (GRU) or long-short-term memory (LSTM) as a bottleneck to capture temporal dependencies, enabling a SE model to effectively learn the dynamics and long-term temporal dependencies in the speech signal. However, Transformers neural networks with self-attention effectively capture long-term temporal dependencies. This study proposes a multi-attention bottleneck (MAB) comprised of a self-attention Transformer powered by a time-frequency attention (TFA) module followed by a channel attention module (CAM) to focus on the important features. The proposed bottleneck (MAB) is integrated into a CED architecture and named MAB-CED. The MAB-CED uses an encoder-decoder structure including a shared encoder and two decoders, where one decoder is dedicated to spectral masking and the other is used for spectral mapping. Convolutional Gated Linear Units (ConvGLU) and Deconvolutional Gated Linear Units (DeconvGLU) are used to construct the encoder-decoder framework. The outputs of two decoders are coupled by applying coherent averaging to synthesize the enhanced speech signal. The proposed speech enhancement is examined using two databases, VoiceBank+DEMAND and LibriSpeech. The results show that the proposed speech enhancement outperforms the benchmarks in terms of intelligibility and quality at various input SNRs. This indicates the performance of the proposed MAB-CED at improving the average PESQ by 0.55 (22.85%) with VoiceBank+DEMAND and by 0.58 (23.79%) with LibriSpeech. The average STOI is improved by 9.63% (VoiceBank+DEMAND) and 9.78% ( LibriSpeech) over the noisy mixtures. INDEX TERMS Multi-attention, time-frequency attention, channel attention, transformer, speech enhancement, gated convolutional encoder-decoder. I. INTRODUCTION Speech enhancement (SE) is a research area in audio signal processing, which improves the intelligibility and quality of a speech deteriorated by various noise sources and interference. The associate editor coordinating the review of this manuscript and approving it for publication was Manuel Rosa-Zurera. With the increasing demand for high-quality communication systems in diverse environments, such as voice assistants, hearing aids, teleconferencing, and telecommunications, speech enhancement enhances overall speech intelligibility and improves the user experience. The speech enhancement process involves advanced algorithms and signal processing techniques, which leverage spectral [1], [2] and statistical 114172 2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023