International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-11, November 2014 295 www.erpublication.org Abstract—Automatic speaker recognition has immense applications and is in research for several decades. Phonetic Engine (PE) is the first stage of Automatic Speech Recognition (ASR), that converts input speech into a sequence of phonetic symbols, which is to be succeeded by a language model to incorporate lexical knowledge of given language. Phonetic Engine uses pattern recognition approach by recognizing the phonemes present in acoustic signal. Due to the large number of phonemes in Malayalam language, phonemes classes become more confusable, and therefore performance of the developed phonetic engine seems inadequate. To improve the performance of the real time phonetic engine, we have developed a front-end for automatically segmenting long test utterances to short segments. This is done by detecting pauses automatically using a feed forward neural network designed for speech/non-speech classification. The phonetic engine with this segmentation front end performs better. Index Term—ANN Classifier, Automatic Speech Recognition, Phonetic engine, Prosody, Segmentation I. INTRODUCTION Recognition of speech signal has immense applications and researchers are trying to build a state of art speech recognition engine over decades [1]. The system which converts speech signal to text is termed as Automatic Speech Recognition (ASR) system. ASR is usually built in two stages. Phonetic engine is the first stage of ASR and it converts speech signal to phonetic symbols. Phonetic engine uses the acoustic phonetic information present in the speech signal in terms of features such as Mel Frequency Cepstral Coefficients (MFCC). The phonetic engine is followed by a language model to incorporate lexical knowledge of given language in ASR. Figure 1 shows the block diagram of an ASR. Malayalam language is spoken mostly by people of Kerala and Lakshadweep. It is one of the scheduled languages of India, which also has a classical language status. Implementing an automatic speech recognition engine in Malayalam has got much significance in cultural, economical domain. Malayalam language consists of fifteen vowels, forty one consonants and six special phonemes called „chillu‟. Considering these sixty two alphabets, we have considered frequently occurring forty phonemes which are necessary for creation of phonemic classes in Malayalam. Manuscript received November 22, 2014. Deekshitha G, Electronics and Communication, Rajiv Gandhi Institute of Technology, Kottayam, Kerala, India. Jubin James Thennattil, Electronics and Communication, Rajiv Gandhi Institute of Technology, Kottayam, Kerala, India. Leena Mary, Electronics and Communication, Rajiv Gandhi Institute of Technology, Kottayam, Kerala, India There are many issues in creating a real time, large vocabulary phonetic engine for continuous speech. The speech signal may not be recorded at studio environment and the silence region may contain some noise/energy regions due to background disturbances. Phonemes like plosives and fricatives, have lower energy compared to vowels and may be misclassified as silence/ background noise. Hence there are insertions, substitution errors in phonetic engine due to the testing environment and sufficiently large number of classes. Prosody is of interest to ASR researchers, as it important for human speech recognition [2, 3]. In all languages, prosody is used to convey structural, semantic, and functional information. Prosody provides valuable information, often not available from text alone; for example, information on phrasing and disfluencies, emotion etc. A crucial step toward robust information extraction from speech is the automatic determination of topic, sentence, and phrase boundaries. Such locations are obvious in text (via punctuation, capitalization, formatting) but are absent or hidden in speech output [2]. Prosody in terms of long pauses is useful to humans for parsing longer utterances to shorter ones. This has motivated us to use prosodic characteristic like pause for automatically segmenting test utterances to shorter phrases. This segmentation helps to decreases some misclassification of silence to other phonemes. The paper is organized as follows: Section II describes the baseline phonetic engine and the issues faced. Modified phonetic engine is explained in Section III. Descriptions about automatic segmentation are given in Subsection III.A. The performance of the proposed system is evaluated in Section IV. Finally the paper is wrapped up with a conclusion and scope for future work in Section V. Speech Phoneme Text Figure 1: Block schematic of an ASR II. IMPLEMENTATION OF PHONETIC ENGINE Automatic speech recognition consists of transformation of the speech signal into a sequence of symbols corresponding to the sub word units of speech, and conversion of the symbol sequence into a text. Typically, continuous speech recognition is performed in the following steps: (1) speech signal-to-symbol (phonetic/syllabic) transformation, and (2) symbol-to-text conversion. Speech signal-to-symbol transformation is performed by a phonetic engine as shown in Implementation of Automatic Segmentation of Speech Signal for Phonetic Engine in Malayalam Deekshitha G, Jubin James Thennattil, Leena Mary Phonetic Engine Language Model