Continuity Metric for Unit Selection based Text-to-Speech Synthesis Vikram Ramesh Lakkavalli, Arulmozhi P and A G Ramakrishnan Medical Intelligence and Language Engineering (MILE) Laboratory Department of Electrical Engineering Indian Institute of Science, Bangalore, 560012, INDIA Email: {vikram.ckm, p.arulmozhi}@gmail.com, ramkiag@ee.iisc.ernet.in Abstract—A new unit selection methodology based on unit continuity metric (UCM) is proposed for enhanced optimality of unit selection. UCM employs two features, pitch continity metric and spectral continuity metric. It has been implemented and tested on our test bed called MILE-TTS, which is available as a web demo. The algorithm is evaluated through the mean opinion score (MOS) from the native users of the language and from a self-selection test. The proposed method gives a self selection accuracy of 100% for a set of 10 text ﬁles. A non- UCM based algorithm using Mel frequency cepstral coefﬁcients as feature at the unit concatenation boundary, returns an average of 83% for the same 10 input ﬁles. Performance comparison shows that the MOS scores, naturalness and comprehension are better with UCM based algorithm than the non-UCM based ones. To further enhance the naturalness of TTS output, a new rule based algorithm for pause prediction between words is developed for Tamil language, based on parts of speech (POS) information of input text. Index Terms—unit selection, MFCC, unit continuity metric, pitch continuity metric, spectral continuity metric, MILE-TTS, part of speech (POS), pause model, PCM, SCM. I. I NTRODUCTION Text-to-speech (TTS) synthesis involves the production of a speech signal corresponding to given input text. It helps people to listen to information rather than reading it from a screen or a paper. In a multilingual country like India, TTS helps people who know more than one language but cannot read the script in some of them, as they can understand the message. Also, TTS is a boon to blind and other people with visual disorders for reading of text from the computer screen or printed books. TTS synthesis can be broadly classiﬁed into i) parametric and ii) non-parametric or unit based TTS synthesis. Each kind of TTS system has its own advantages and disadvantages. In parametric synthesis, TTS is built on the stochastic concepts such as, HMM based synthesis, formant based synthesis sinusoidal model based synthesis. Whereas, unit selection based TTS systems use annotated speech data corpus and concatenate the corresponding units to produce output speech. Depending on the choice of basic speech unit, TTS can further be categorized into demisyllable based, diphone based [4], syllable based [2] and polyphone based [14],[15] TTS systems. Appropriate selection of the type of speech units not only has the potential to give rise to better quality of synthesized speech, but also drives the amount and kinds of challenges in signal processing. The quality of the TTS system is determined by the intel- ligibility of synthesized speech, and its naturalness, a quality that indicates closeness to a human voice. Natural speech has good prosody, where prosody is deﬁned as the collection of the dynamic features of speech such as pitch, formant, duration, pause and stress. There has been much work on statistical prosody model to make the TTS synthesis more natural. In this paper, we employ the prosodic information for unit selection and a rule based pause estimation between the words which are explained in the subsequent sections. For Indian languages, there hasn’t been much progress in TTS technology. Some of the reasons for this are: 1) Lack of good prosody model for Indian languages. 2) Lack of concerted efforts to build good annotated speech corpora. 3) Absence of research and study in computational linguis- tics. We have developed a test bed for developing TTS named MILE-TTS, which uses variable length polyphone as ba- sic units. MILE-TTS currently supports Kannada and Tamil speech synthesis. II. DESCRIPTION OF MILE-TTS MILE-TTS is a text-to-speech synthesis system based on a variable length polyphonic unit as the basic unit for concate- nation. The length of polyphonic unit is selected depending on the word and sequence of phonetic units. MILE-TTS engine employs polyphonic unit based concatenation, where appropriate speech segments are selected from the manually segmented, annotated speech database. The Kannada database has 8 hours of speech data with 1110 phonetically rich sentences recorded by a professional Kannada male speaker and stored at 16 kHz sampling frequency. Tamil database contains 5 hours of speech data with 1027 phonetically rich sentences stored at the same rate. The database is segmented and labeled at phoneme level. There are a total of 64841 basic polyphonic units in Kannada database and 42012 in Tamil database. MILE-TTS test bed is language independent and at present only Kannada and Tamil language speech synthesis is supported.