Open Access. © 2019 Naima Zerari et al., published by De Gruyter. This work is licensed under the Creative Commons Attribution
alone 4.0 License
Open Comput. Sci. 2019; 9:92ś102
Research Article Open Access
Naima Zerari*, Samir Abdelhamid, Hassen Bouzgou, and Christian Raymond
Bidirectional deep architecture for Arabic speech
recognition
https://doi.org/10.1515/comp-2019-0004
Received July 20, 2018; accepted March 4, 2019
Abstract: Nowadays, the real life constraints necessitates
controlling modern machines using human intervention
by means of sensorial organs. The voice is one of the hu-
man senses that can control/monitor modern interfaces.
In this context, Automatic Speech Recognition is princi-
pally used to convert natural voice into computer text as
well as to perform an action based on the instructions
given by the human. In this paper, we propose a general
framework for Arabic speech recognition that uses Long
Short-Term Memory (LSTM) and Neural Network (Multi-
Layer Perceptron: MLP) classifer to cope with the non-
uniform sequence length of the speech utterances issued
from both feature extraction techniques, (1) Mel Frequency
Cepstral Coefcients MFCC (static and dynamic features),
(2) the Filter Banks (FB) coefcients. The neural architec-
ture can recognize the isolated Arabic speech via classif-
cation technique. The proposed system involves, frst, ex-
tracting pertinent features from the natural speech signal
using MFCC (static and dynamic features) and FB. Next,
the extracted features are padded in order to deal with the
non-uniformity of the sequences length. Then, a deep ar-
chitecture represented by a recurrent LSTM or GRU (Gated
Recurrent Unit) architectures are used to encode the se-
quences of MFCC/FB features as a fxed size vector that will
be introduced to a Multi-Layer Perceptron network (MLP)
to perform the classifcation (recognition). The proposed
system is assessed using two diferent databases, the frst
one concerns the spoken digit recognition where a com-
parison with other related works in the literature is per-
formed, whereas the second one contains the spoken TV
commands. The obtained results show the superiority of
the proposed approach.
Keywords: Arabic ASR, digits, command TV, speech recog-
nition, MFCC, delta-delta, FB, deep learning, LSTM, GRU,
MLP
*Corresponding Author: Naima Zerari: Laboratory of Automation
and Manufacturing, Department of Industrial Engineering, Univer-
sity of Batna 2 Mostefa Ben Boulaid, Batna, 05000, Algeria;
E-mail: n.zerari@univ-batna2.dz
1 Introduction
Speech is one of the most direct means of information ex-
change used by human being. This advantage has given
rise to several developments where the aim is the design
of systems to recognize spoken words. Automatic Speech
Recognition (ASR) is an active area of study allowing the
communication between human and machine. It is the
process of understanding the human speech by a com-
puter. In this context, Automatic Digit/Command Recog-
nition is considered as one of the most challenging do-
mains in ASR. The growing importance of Digit/Command
recognition is mainly due to the increasing demand for
applications that deal with human-machine interaction
through natural languages such as command systems via
pronounced digit [1, 2].
The implementation of these kinds of systems requires
a particular process for the speech signal to provide reli-
able features that can recognize properly the input spo-
ken words. Therefore, wide range of techniques have been
proposed in the literature to represent the speech signal
[3]. The most commonly used one, is the Mel-Frequency
Cepstral Coefcients (MFCC), which is a popular technique
that attempt to mimic some parts of the human speech per-
ception and speech production [4].
In the present study, the obtained MFCC coefcients of
the spoken utterances will be introduced to a Long Short-
Term Memory (LSTM) architecture [5], which treats the
general sequence-to-sequence problems. The idea is to use
a bidirectional LSTM layer included in the deep architec-
ture to encode the sequence as a fxed size vector, then
this vector will be fed to a Multi-Layer Perceptron (MLP)
Samir Abdelhamid: Laboratory of Automation and Manufacturing,
Department of Industrial Engineering, University of Batna 2 Mostefa
Ben Boulaid, Batna, 05000, Algeria; E-mail: s.abdelhamid@univ-
batna2.dz
Hassen Bouzgou: Department of Industrial Engineering, University
of Batna 2 Mostefa Ben Boulaid, Batna, 05000, Algeria;
E-mail: h.bouzgou@univ-batna2.dz
Christian Raymond: INSA Rennes, IRISA/INRIA, Rennes, France;
E-mail: christian.raymond@irisa.fr