© 2015, IJARCSSE All Rights Reserved Page | 475
Volume 5, Issue 10, October-2015 ISSN: 2277 128X
International Journal of Advanced Research in
Computer Science and Software Engineering
Research Paper
Available online at: www.ijarcsse.com
A Review of Unit Selection Speech Synthesis
Sangramsing Kayte
Department of Computer Science &
Information Technology
Dr. Babasaheb Ambedkar Marathwada
University, Aurangabad, India
Monica Mundada
Department of Computer Science &
Information Technology
Dr. Babasaheb Ambedkar Marathwada
University, Aurangabad, India
Dr. Charansing Kayte
Assistant Professor
Department of Digital and Cyber
Forensic, Aurangabad
Maharashtra, India
Abstract— Speech is used to express information, emotions, and feelings. Speech synthesis is the technique of
converting given input text to synthetic speech. Speech synthesis can be used to read text as in SMS, newspapers, site
information etc. and can be used by blind people. Speech synthesis has been widely researched in last four decades.
The quality and intelligibility of the synthetic speech produced is remarkably good for most of the applications. This
report intends to review four majorly researched methods of speech synthesis viz. Articulatory, Concatenated,
Formant, and Quasi-articulatory Synthesis. Mainly in this paper focus is given on concatenate synthesis method and
some issues of this method are discussed. Articulatory Synthesis is based on human speech production model. The
synthetic speech produced by this model is most natural, but it is also the most difficult method. Concatenate Synthesis
uses prerecorded speech words, phrases and concatenates them to produce sound. It is the simplest method and yields
high-quality speech but is limited by its memory requirement to store beforehand all possible words, phrases to be
produced. Formant Synthesis is based on the acoustic model of the human speech production system. It models the
sound source and the resonance in the vocal tract, and is most common model used. Quasi-articulatory Synthesis is a
hybrid of articulator acoustic model of speech production. Synthetic speech produced by this model sounds more
natural and can be easily customized to meet different requirements of different applications and individual users.
Keywords— Unit selection Speech synthesis, articulatory synthesizer, formant synthesizer, concatenative synthesizer.
I. INTRODUCTION
Unit selection synthesis is also referred as corpus based synthesis. It uses large database. During database creation, each
recorded utterance is segmented into some individual phones, syllables, morphemes, words, phrases, and sentences. An
index of the units in the speech database is then made based on the segmentation and acoustic parameters such as
fundamental frequency, pitch, duration, the status of the syllable and previous and next phones. This method provides
naturalness in output speech as compared to other techniques. Speech synthesis is a process of automatic generation of
speech by machines/computers. The goal of speech synthesis is to develop a machine having an intelligible, natural
sounding voice for conveying information to a user in a desired accent, language, and voice. Unit selection synthesis
shown in Fig.1 is a type of concatenative synthesis in which the largest matching sound file available in the speech
corpus is concatenated for synthesis of target speech. It is capable of managing large number of units [1], also imparts
prosody beyond the role of F0. It is quite necessary to make a clear distinction between role of F0 and Pitch: F0 is the
actual frequency generated by the vocal cord or vocal fold, while Pitch is the perception of that frequency by the listener.
Hence it not necessary that both are equal.This synthesis technique also retains the naturalness in the speech sounds
being generated. Choosing unit length is an important task in Concatenative speech synthesis. A shorter unit length
requires less spacebut sample collecting and labeling becomes more difficult and complex. A longer unit length gives
more naturalness [2], better coarticulation effect and less concatenation points but requires more memory space. Choices
of unitfor TTS are phonemes, diphones, triphones, demi syllables, syllables and words [3][4].
Fig. 1 Unit Selection Synthesis system
Unit-selection speech synthesis has become increasingly popular due to its enhanced prosodic quality and naturalness
when compared to parametric or diphone synthesizers. The principle is based on the concatenation of naturally-produced