International Journal of Computer Applications (0975 – 8887) Volume 53– No.16, September 2012 13 Removal of Spectral Discontinuity in Concatenated Speech Waveform Deepika Singh Department of Computer Science and Engineering Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Parminder Singh Associate Professor, Department of Computer Science and Engineering Guru Nanak Dev Engineering College, Ludhiana, Punjab, India ABSTRACT Speech synthesis systems which involve concatenation of recorded speech units are currently very popular. These systems are known for producing high quality, natural- sounding speech as they generate speech by joining together waveforms of different speech units. This method of speech generation is quite practical. However the speech units that are being concatenated may have different spectra on either side of the concatenation points. Such mismatches are spectral in nature and give rise to spectral discontinuity in concatenated speech waveforms. The presence of such discontinuities can be very distracting to the listener and degrade the overall quality of output speech. This paper proposes a speech signal processing technique that deals with the problem of spectral discontinuity in the context of concatenated waveform synthesis. It involves the post-processing of the synthesized speech waveform in time domain. This technique is implemented on different single channel Punjabi wave audio files which were created by concatenating different Punjabi syllables. A listening test was conducted to evaluate the proposed technique, and it was observed that the spectral discontinuity is reduced to a large extent and the output speech sounds more natural with the reduction of audible noise. General Terms Technique for speech signal processing Keywords Speech waveform, Concatenative speech synthesis, Spectral discontinuity 1. INTRODUCTION Speech is the most primary form of communication used by human beings to express their thoughts, feelings and ideas. Speech production involves a series of complex movements that alter and mould the basic tone created by human voice into specific sounds [1]. The mechanism for generating the human voice can be subdivided into three parts; the lungs, the vocal folds within the larynx, and the articulators (the parts of the vocal tract above the larynx consisting of tongue, palate, cheek, lips, nose and teeth). Speech sounds are created when air pumped from the lung causes vibratory activity in the human vocal tract. These vibrations themselves can be represented by speech waveforms. Figure 1 shows a visual representation of vibrations typical of those in human speech - a speech waveform for a Punjabi word “ਦਰ”. Figure 1: Example Speech Waveform for Punjabi word- “ਦਰ” A computer system with the ability to convert written text into speech is known as Text-To-Speech (TTS) synthesis system. The quality of a speech synthesizer is judged by naturalness, which refers to the similarity of generated speech to the real human voice; and intelligibility, which refers to the ability of generated speech to be understood. The main goal of researchers and linguists is to create ideal speech synthesis systems which are both natural and intelligible. Three types of methods are mainly used for the purpose of synthesizing artificial speech- Articulatory Synthesis, Formant Synthesis and Concatenative Synthesis [2]. The articulatory and formant synthesis are the rule-based synthesis methods whereas the concatenative technique is a database-driven synthesis method. Articulatory synthesis uses a physical model of human speech production organs and articulators. Formant synthesis models the frequencies of speech signal based on source-filter model. In this method of speech synthesis, parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a speech waveform based on certain rules. Concatenative synthesis generates speech by concatenating recorded speech units and is described in more detail in Section 2. The remainder of this paper is organized into 6 sections. Section 2 presents an overview of concatenative speech synthesis. In section 3, the problem of spectral discontinuity in the context of concatenative speech synthesis is discussed. Section 4 explains the stages of the technique proposed to remove audible spectral discontinuities in concatenated speech waveform. Section 5 evaluates the results of the proposed technique. Finally we end our paper with Conclusions and Future work in Section 6.