International Journal of Computer Trends and Technology (IJCTT) – Volume 42 Number 2 – December 2016 ISSN: 2231-2803 http://www.ijcttjournal.org Page 102 Large Vocabulary in Continuous Speech Recognition Using HMM and Normal Fit Hemakumar G #1 , Punithavalli M *2 , Thippeswamy K #3 1# Research Scholar, Bharathiar University, Coimbatore, Tamil Nadu, India and Government College for Women (Autonomous), Mandya, Karnataka. 2* Department of Computer Application, Bharathair University, Coimbatore, Tamil Nadu, India. 3# Department of Computer Science, Visvesvarya Technological University, Mysore Regional Centre, Mysuru, Karnataka, India. Abstract— this paper addresses the problem of large vocabulary speaker independent continuous speech recognition using the phonemes, Hidden Markov Model (HMM) and Normal fit method. Here we first detect for the voiced part in speech signal through computing dynamic threshold in each frame. Real Cepstrum coefficients are extracted as features from the voiced frames. The Baum–Welch algorithm is applied for training those features. Then normal fit technique is applied, the outputted values are labelled using correspondent phoneme or syllable. The model is tested for 5 languages namely English, Kannada, Hindi, Tamil and Telugu. The automatic segmentation of speech signals average accuracy rate is 95.42% and miss rate of about 4.58%. In the large vocabulary, average Word Recognition Rate (WRR) is 85.16% and average Word Error Rate (WER) is 14.84%. All computations are done using mat lab. Keywords — Automatic Speech Recognition (ASR), Speech Enhancement, Speech Perception, HMM and Normal fit method. I. INTRODUCTION Automatic Speech Recognition is a computerized process where machine shall receive as its input a speech recording and it produces as its output a transcription. The main aim of an ASR system is to accurately and efficiently convert a speech signal into a text message transcription of the spoken words, independent of the device used to record the speech (i.e., the transducer or microphone), the speaker’s accent or the acoustic environment in which the speaker is located (e.g., office, noisy room, outdoors). The major problem that complicates ASR implementation is speaker’s variability. Because ASR systems are supposed to be general use systems they have to support multiple speakers and be able to adapt to all the variations that introduces. There are variations in speech styles, pitch and anatomy that make each speaker unique. Also things like background noise, utterances, and dialects can negatively affect the interpretation of speech. Even words that sound alike can create problems for ASR systems [1]. In the ASR, problems occur in performance when moving from speaker-dependent (SD) to speaker-independent (SI) conditions for connectionist HMM or Artificial Neural Network (ANN) systems in the context of large vocabulary in continuous speech recognition (LVCSR) [2]. The performance of ASR system may also reduce due to degrade in speech signal by noises. Those noises are like Gaussian white noise, pink, red and gray noises which occur during recording time. The noise occurs by multi speakers during recording is most challenging task to handle. Those noises should be reduced before signal segmentation and feature extraction. The enhancement of speech signal is required in order to improve the intelligibility and overall perceptual quality of degraded speech signal using audio signal processing techniques. The enhancement of speech signal which is corrupted by noise is commonly performed using the short-time discrete Fourier transform domain [3]. The Bayesian algorithm for speech enhancement under a stochastic deterministic speech models which makes provision for the inclusion of apriori information by considering a non-zero mean [4]. The filters which allow explicit control of the tradeoff between noise reduction and speech distortion via the chosen rank of the signal subspace [5]. Paper [6] discuss regarding the measurement of enhancement considered a wide range of distortions introduced by four types of real-world noise at two signal-to-noise ratio levels by four classes of speech enhancement algorithms namely spectral subtractive, subspace, based on statistical-models, and Wiener algorithms. The problem in designing the ASR system may also occur while selecting the frame size. In ASR system the windowing is done using the short-time frequency analysis. But in reality it has been conclude that human hearing is relatively insensitive to short-time phase distortion of the speech signal, so there is no apparent reason for the use of symmetric windows which give a linear phase response [7]. Paper [8] discuss on the paradigm of statistic in speech recognition for phonetic and phonological knowledge sources. They discuss on computational phonology and mathematical models like Bayesian analysis, statistical estimation theory, non-stationary time series, dynamic system theory and nonlinear function approximation theory.