Vol.:(0123456789) 1 3 International Journal of Speech Technology (2022) 25:79–88 https://doi.org/10.1007/s10772-022-09961-0 Boosting subjective quality of Arabic text‑to‑speech (TTS) using end‑to‑end deep architecture Fady K. Fahmy 1  · Hazem M. Abbas 1  · Mahmoud I. Khalil 1 Received: 3 November 2020 / Accepted: 4 January 2022 / Published online: 8 February 2022 © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022 Abstract End-to-end speech synthesis methods managed to achieve nearly natural and human-like speech. They are prone to some synthesis errors such as missing or repeating words, or incomplete synthesis. We may argue this is mainly due to the local information preference between text input and the learned acoustic features of a conditional autoregressive (CAR) model. The local information preference prevents the model from depending on text input when predicting acoustic features. It contributes to synthesis errors during inference time. In this work, we are comparing two modifed architectures based on Tacotron2 to generate Arabic speech. The frst architecture replaces the WaveNet vocoder with a fow-based implementa- tion of WaveGlow. The second architecture, infuenced by InfoGan, maximizes the mutual information between text input and predicted acoustic features (mel-spectrogram) to eliminate the local information preference. The training objective has been also changed by adding a CTC loss term. The training objective could be considered as a metric of local information preference between text input and predicted acoustic features. We carried the experiments on Nawar Halabi’s dataset (http:// en.arabicspeechcorpus.com/) which contains about 2.41 h of Arabic speech. Our experiments show that maximizing mutual information between predicted acoustic features and conditional text input as well as changing the training objective can enhance the subjective quality of generated speech and reduce the utterance error rate. Keywords Tacotron 2 · WaveGlow · InfoGan · Arabic text-to-speech · Speech synthesis · Deep learning · Neural networks 1 Introduction Speech synthesis remains a hard task despite several dec- ades of work. Conventional text-to-speech (TTS) systems are often complex and made up of several components con- nected through a pipeline which may include text analysis front-ends, acoustic models, and speech synthesis models. Building each component requires a lot of labour work and domain expertise. In addition, components are trained separately, as a result, each component error may cascade to later stages making errors at the fnal stages compound. Let’s take concatenative speech synthesis as an example, building such a system requires (a) Text cleaning and nor- malization, (b) Syllabifcation and lexical stress prediction by dividing a word’s phonetic representation into sylla- bles, and marks lexical stress, (c) Statistical and semantic analysis, (d) Defning a unit size which may vary from one to three units and automatically align and pre-segment a recorded voice database into basic building units, (e) Join the selected units, taking care to reduce perceptual acoustic artifacts, and (f) Ensure that the resulting output waveform refects the pitch, duration, and energy values of the pros- ody targets. End-to-end neural network architectures allevi- ate much of the labour work needed to synthesize speech. Tacotron1 (Wang et al., 2017) is an example of an end-to- end speech synthesis architecture. It is a generative encoder- decoder architecture with attention (Sutskever et al., 2014) taking a sequence ofcharacters as input and generating audio waveform.Tacotron1 uses a content-based attention mecha- nismdescribed in Bahdanau et al. (2015). Tacotron2 (Shen et al., 2017) is a natural evolution of tacotron1. It ofers a unifed purely neural network approach to address the limi- tations of Tacotron1 and enhance the subjective quality of * Fady K. Fahmy Hazem M. Abbas hazem.abbas@eng.asu.edu.eg Mahmoud I. Khalil mahmoud.khalil@eng.asu.edu.eg 1 Department of Computer and Systems Engineering, Ain Shams University, Cairo, Egypt