IJITEE, Vol. 3, No. 2, June 2019 Real-Time Indonesian Language Speech Recognition with MFCC Algorithms and Python-Based SVM Wening Mustikarini 1 , Risanuri Hidayat 2 , Agus Bejo 3 Abstract—Automatic Speech Recognition (ASR) is a technology that uses machines to process and recognize human voice. One way to increase recognition rate is to use a model of language you want to recognize. In this paper, a speech recognition application is introduced to recognize words "atas" (up), "bawah" (down), "kanan" (right), and "kiri" (left). This research used 400 samples of speech data, 75 samples from each word for training data and 25 samples for each word for test data. This speech recognition system was designed using Mel Frequency Cepstral Coefficient (MFCC) as many as 13 coefficients as features and Support Vector Machine (SVM) as identifiers. The system was tested with linear kernels and RBF, various cost values, and three sample sizes (n = 25, 75, 50). The best average accuracy value was obtained from SVM using linear kernels, a cost value of 100 and a data set consisted of 75 samples from each class. During the training phase, the system showed a f1-score (trade-off value between precision and recall) of 80% for the word "atas", 86% for the word "bawah", 81% for the word "kanan", and 100% for the word "kiri". Whereas by using 25 new samples per class for system testing phase, the f1-score was 76% for the "atas" class, 54% for the "bawah" class, 44% for the "kanan" class, and 100% for the "kiri" class. Keywords—Automatic Speech Recognition, Indonesian Language, MFCC, SVM. I. INTRODUCTION Research on ASR has been carried out for more than 40 years, but until now there are still studies to find ASR that can recognize speech in any subject by various languages. One way to increase recognition rate is by using a model from a language that needs to be recognized. Indonesian Language is as one of non-mainstream language that does not yet have speech corpus compared to other languages [1], [2], thus it influences speech recognition level in Indonesia. In addition, devices that use voice input have been limited to languages other than Indonesian, for example, such as the S Voice virtual assistant application from Samsung which only accepts input in English, Mandarin and Korean, or Alexa from the Amazon Echo that receives input in English , France, Japan, Spain, and Italy. In order for the voice recognition system to work better in Indonesian language, the machine needs to be taught first using the Indonesian speech corpus. However, automatic speech recognition requires a model from a related language, while not all languages have that model. Building a language model is a complicated process because it requires a lot of speech from many speakers. Based on the explanation, it can be concluded that a speech recognition system using the Indonesian language model is needed to improve recognition system performance in recognizing words in Indonesian language. II. AUTOMATIC SPEECH RECOGNITION (ASR) Automatic Speech Recognition (ASR) is commonly used to convert speech into texts. In addition, ASR is also used for biometric authentication, which authenticates users from their voice. From the identification or recognition process, ASR can be used to perform a task based on recognized instructions. In order to work properly, ASR requires a configuration or voice that has been saved from the user. Users need to train ASR or machines by storing speech patterns (features) into the system. To obtain these features, data processing (feature extraction) is needed so that a value that represents information contained in the data is produced. In addition, methods are also needed to recognize those features. A. Feature Extraction with Mel Frequency Cepstral Coefficient (MFCC) At the training phase, the system goes through a learning process in order to recognize words. To be able to do this, the system requires information in a form of a word pattern obtained from feature extraction. Feature extraction in speech recognition is a computation of the voice signal to produce a features vector that represents the signal. These features are then compared with the test data features. In this paper, Mel Frequency Cepstrum Coefficient (MFCC) was used for feature extraction of 13 coefficients. At present, MFCC is most commonly used in speech recognition and speaker verification because MFCC can work well on inputs with a high level of correlation, i.e., by removing information that is not needed. In addition, MFCC can represent human voice and music signals well because MFCC uses mel frequency [3]. The mel scale itself is an association between frequency (tone) heard or perceived by humans with actual measured frequency [4]. Equation (1) defines a relationship between mel frequency and an actual frequency. Instead, to obtain a frequency value from mel scale, (2) was used. ( ) = 1125  (1 +  700 ) (1)  −1 ( ) = 700 (   1125 − 1 ) (2) 1 Department of Electrical Engineering and Information Technology, Faculty of Engineering, Gadjah Mada University Jl. Grafika No.2, Yogyakarta, 55281 INDONESIA (tlp/fax:0274- 552305/547506; e-mail: wening.mustikarini@mail.ugm.ac.id) 2, 3 Lecturer, Department of Electrical Engineering and Information Technology, Faculty of Engineering, Gadjah University Jl. Grafika No.2, Yogyakarta, 55281, INDONESIA (tlp/fax:0274- 552305/54750) Wening Mustikarini: Real-Time Indonesian Language ... ISSN 2550 – 0554 (Online) 55