Speech rate estimation using representations learned from speech with convolutional neural network Renuka Mannem IISc Bangalore Hima Jyothi RGUKT, Kadapa Aravind Illa IISc Bangalore Prasanta Kumar Ghosh IISc Bangalore Abstract—With advancement in machine learning techniques, several speech related applications deploy end-to-end models to learn relevant features from the raw speech signal. In this work, we focus on the speech rate estimation task using an end-to- end model to learn representation from raw speech in a data driven manner. We propose an end-to-end model that comprises of 1-d convolutional layer to extract representations from raw speech and a convolutional dense neural network (CDNN) to predict speech rate from these representations. The primary aim of the work is to understand the nature of representations learned by end-to-end model for the speech rate estimation task. Experiments are performed using TIMIT corpus, in seen and unseen subject conditions. Experimental results reveal that, the frequency response of the learned 1-d CNN ﬁlters are low-pass in nature, and center frequencies of majority of the ﬁlters lie below 1000Hz. While comparing the performance of the proposed end- to-end system with the baseline MFCC based approach, we ﬁnd that the performance of the learned features with CNN are on par with MFCC. I. I NTRODUCTION Signal processing and machine learning techniques are playing an important role in understanding speech for human- computer interaction, which ﬁnds applications including speech-to-text, speaker recognition and developing diagnosis tools in speech pathology. Apart from linguistic information in speech, humans embed a variety of speech cues like intonation, stress and speech rate variations etc., which are known as para- linguistic information, that conveys the emotional state [1], nativity [2], gender and health condition of a speaker [3], [4]. These factors affect the performance of the tools developed for speech technologies [5], [6]. The recent trends in speech fo- cuses on characterizing the para-linguistic information which are encoded in speech over a long duration (supra-segments). The objective of this work is on the automatic estimation of speech rate from raw speech. Speech rate from a given speech segment is computed as the number of syllables per second [5], [7]. Speech rate is an important factor in various applications including automatic speech recognition (ASR) [5], where variations in speech rate impact the ASR performance. Speech rate is also used in speech modiﬁcation [8], to observe the variations in emotion [1] and in assessment of non-nativeness [2]. In clinical ap- plications, speech rate is considered as an important cue in assessing the decline in speech and articulatory movements in dysarthic patients [3], [4]. To estimate the speech rate, various methods have been proposed in the literature. Hidden Markov Model (HMM) is used to estimate speech rate in [2], [9]–[11], which require reference transcription that may not be always available in practice [7]. Several methods have been proposed which do not require transcriptions but only speech acoustics [12]–[17]. In this regard, there are both unsupervised and supervised approaches. Among unsupervised approaches, a peak detection strategy [16] using a convex weighting criterion was used for speech rate estimation. Temporal correlation and selected sub- band correlation (TCSSBC) based feature contour was utilized in [7], [17] to estimate speech rate, in which peak detection was performed with smoothing and thresholding operations. On the other hand, among supervised approaches, a Gaussian mixture model (GMM) based method was proposed to classify speech into slow, medium and fast rate classes and these class probabilities were used to estimate the speech rate. Recently, using neural networks, syllable rate estimation is formulated as a regression problem, and mean squared error (MSE) loss between the estimated and original speech rate is optimized to train the convolutional dense neural networks (CDNN) [18]. In all the above approaches, the choice of features are either knowledge based hand-crafted or features adapted from other speech tasks. Since speech rate variations change several characteristics of acoustics as well as the articulatory move- ments, features are proposed based on various acoustics and articulatory characteristics [18], [19]. Recent advancement in machine learning techniques enable learning relevant features for a given task from the raw waveform using end-to-end network. These are applied to speech recognition [20], [21], speaker veriﬁcation [22], [23] and several speech related applications [24], [25]. In this work, we investigate on the learning representations from raw waveform for speech rate estimation task. To learn representations from raw waveform we deploy 1-d CNN ﬁlters as a ﬁrst layer following the works on other speech tasks in [20], [24]. Using this framework, we hypothesize that the representation learned from raw waveform will be different from those of mel-scale as observed in [20], [24]. This is because unlike speech recognition task, where the representations learned have to distinguish different phonemes, in speech rate estimation there is no need to dis- criminate phonemes. For estimating the number of syllables, we hypothesize that learning ﬁlters to emphasize the speciﬁc spectral regions could beneﬁt the speech rate estimation task. Thereby the learned CNN ﬁlters optimized for speech rate estimation task could be different than that of mel-scale and may not necessarily span the complete frequency spectrum.