Open Access. © 2019 Naima Zerari et al., published by De Gruyter. This work is licensed under the Creative Commons Attribution alone 4.0 License Open Comput. Sci. 2019; 9:92ś102 Research Article Open Access Naima Zerari*, Samir Abdelhamid, Hassen Bouzgou, and Christian Raymond Bidirectional deep architecture for Arabic speech recognition https://doi.org/10.1515/comp-2019-0004 Received July 20, 2018; accepted March 4, 2019 Abstract: Nowadays, the real life constraints necessitates controlling modern machines using human intervention by means of sensorial organs. The voice is one of the hu- man senses that can control/monitor modern interfaces. In this context, Automatic Speech Recognition is princi- pally used to convert natural voice into computer text as well as to perform an action based on the instructions given by the human. In this paper, we propose a general framework for Arabic speech recognition that uses Long Short-Term Memory (LSTM) and Neural Network (Multi- Layer Perceptron: MLP) classifer to cope with the non- uniform sequence length of the speech utterances issued from both feature extraction techniques, (1) Mel Frequency Cepstral Coefcients MFCC (static and dynamic features), (2) the Filter Banks (FB) coefcients. The neural architec- ture can recognize the isolated Arabic speech via classif- cation technique. The proposed system involves, frst, ex- tracting pertinent features from the natural speech signal using MFCC (static and dynamic features) and FB. Next, the extracted features are padded in order to deal with the non-uniformity of the sequences length. Then, a deep ar- chitecture represented by a recurrent LSTM or GRU (Gated Recurrent Unit) architectures are used to encode the se- quences of MFCC/FB features as a fxed size vector that will be introduced to a Multi-Layer Perceptron network (MLP) to perform the classifcation (recognition). The proposed system is assessed using two diferent databases, the frst one concerns the spoken digit recognition where a com- parison with other related works in the literature is per- formed, whereas the second one contains the spoken TV commands. The obtained results show the superiority of the proposed approach. Keywords: Arabic ASR, digits, command TV, speech recog- nition, MFCC, delta-delta, FB, deep learning, LSTM, GRU, MLP *Corresponding Author: Naima Zerari: Laboratory of Automation and Manufacturing, Department of Industrial Engineering, Univer- sity of Batna 2 Mostefa Ben Boulaid, Batna, 05000, Algeria; E-mail: n.zerari@univ-batna2.dz 1 Introduction Speech is one of the most direct means of information ex- change used by human being. This advantage has given rise to several developments where the aim is the design of systems to recognize spoken words. Automatic Speech Recognition (ASR) is an active area of study allowing the communication between human and machine. It is the process of understanding the human speech by a com- puter. In this context, Automatic Digit/Command Recog- nition is considered as one of the most challenging do- mains in ASR. The growing importance of Digit/Command recognition is mainly due to the increasing demand for applications that deal with human-machine interaction through natural languages such as command systems via pronounced digit [1, 2]. The implementation of these kinds of systems requires a particular process for the speech signal to provide reli- able features that can recognize properly the input spo- ken words. Therefore, wide range of techniques have been proposed in the literature to represent the speech signal [3]. The most commonly used one, is the Mel-Frequency Cepstral Coefcients (MFCC), which is a popular technique that attempt to mimic some parts of the human speech per- ception and speech production [4]. In the present study, the obtained MFCC coefcients of the spoken utterances will be introduced to a Long Short- Term Memory (LSTM) architecture [5], which treats the general sequence-to-sequence problems. The idea is to use a bidirectional LSTM layer included in the deep architec- ture to encode the sequence as a fxed size vector, then this vector will be fed to a Multi-Layer Perceptron (MLP) Samir Abdelhamid: Laboratory of Automation and Manufacturing, Department of Industrial Engineering, University of Batna 2 Mostefa Ben Boulaid, Batna, 05000, Algeria; E-mail: s.abdelhamid@univ- batna2.dz Hassen Bouzgou: Department of Industrial Engineering, University of Batna 2 Mostefa Ben Boulaid, Batna, 05000, Algeria; E-mail: h.bouzgou@univ-batna2.dz Christian Raymond: INSA Rennes, IRISA/INRIA, Rennes, France; E-mail: christian.raymond@irisa.fr