Topics in Intelligent Computing and Industry Design (ICID) 2(2) (2020) 113-116 Quick Response Code Access this article online Website: www.intelcomp-design.com DOI: 10.26480/etit.02.2020.113.116 Cite The Article: Aditya Pandya, Abhishek Bhole, Arnav Shrivastava, Mrs. Vineeta Rathore(2020).Review Of Text To Speech Using Deep Convolution. Topics In Intelligent Computing And Industry Design, 2(2): 113-116. ISBN: 978-1-948012-17-1 Ethics and Information Technology (ETIT) DOI: http://doi.org/10.26480/etit.02.2020.113.116 REVIEW OF TEXT TO SPEECH USING DEEP CONVOLUTION Aditya Pandya, Abhishek Bhole, Arnav Shrivastava, Mrs. Vineeta Rathore Medi-Caps University, Indore (M.P.), India *Corresponding Author Email: rafebarber7203@gmail.com This is an open access article distributed under the Creative Commons Attribution License CC BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ARTICLE DETAILS ABSTRACT Article History: Received 25 October 2020 Accepted 26 November 2020 Available online 03 December 2020 Speech is one of the oldest and most natural medium of human interaction. So, it is understandable that there is a need today, to make our computers interact using speech as well. In recent times, recurrent neural networks (RNN) have been the pinnacle of sequential data modeling techniques. They have achieved an apparent ubiquity as far as techniques to model sequential data are concerned. However, training RNN is a time-consuming task. A typical RNN can take anywhere from a few days to a few weeks to be trained. Also, RNN requires notoriously strong hardware to be trained. The weaker the hardware, the more time it takes. According to recent studies, Convolutional Neural Networks are proved to be easier and cheaper to train due to high parallelizability. This paper aims at analyzing and reviewing a novel TTS technique using CNN instead of RNN that can alleviate the massive economic costs of training on RNN based models. In this experiment, an ordinary gaming PC was used to train the network. It took considerably less time to train than its recurrent counterpart and the quality of the resulting speech was acceptable. KEYWORDS TTS, MEL Spectrogram, FT (Fourier Transform), STFT (Short time FT). 1. INTRODUCTION TTS or Text-To-Speech is a growing necessity in a world which is zooming towards a new and technologically better future. To promote the further development of TTS it is necessary to create reliable, maintainable, and flexible TTS component which is accessible to a person without a technical background, business individuals and small organizations which don’t have computers capable to perform complex operations for a longer period of time. Traditional TTS systems are not very user friendly as they are often an integration of several task specific modules which are not easy to operate for a novice. For example, a traditional TTS system would comprise of a module used for text analysis, a module which generates spectrograms and another module which uses all this information to generate a waveform as an output etc. Deep learning largely helps us in uniting these building blocks into a single model or models depending on the functionality, and directly maps a relation between input and output; this is called ‘end-to-end’ learning. The idea is to represent the usefulness and need of a novel, concise and efficient neural TTS which is fully convolutional. The architecture of this TTS is largely similar to Tacotron but is based on a fully convolutional sequence-to-sequence learning model (Barron, 2017). We therefore aim at showing that this handy TTS works reasonably without demanding high resources. The basis of this is two- fold: Reviewing an already proposed fully CNN-based TTS system that can be trained much faster than an RNN-based state-of-the-art neural TTS system, while the sound quality is still acceptable (Efficiently Trainable, 2017). 2. PRELIMINARY To understand how this works, you first need to know the representation of sound in the world of data science and what kind of various operations we can perform on this representation to acquire suitable outputs. Some Basic Concepts Are Mentioned below: 2.1 Fourier Transform (FT) Before we start with what Fourier Transform is, let us first understand some terminologies that we will need to know about to understand it in a better way (Yi-Wen, 2015). Time-Domain Representation or a Wave Form - A Wave Form or an audio signal is the raw representation of a sound wave as a function of Time and its Intensities. This representation is also known as the Time-Domain Representation or the Time-Amplitude Representation (Yi-Wen, 2015). Figure 1: Time-Domain Representation or a wave form This paper was presented at International Conference on Contemporary Issues in Computing (ICCIC-2020) - Virtual IETE Sector V, Salt Lake, Kolkata From 25th-26th July 2020