Int. J. Signal and Imaging Systems Engineering, Vol. 9, Nos. 4/5, 2016 209
Copyright © 2016 Inderscience Enterprises Ltd.
A microphone array beamforming-based system for
multi-talker speech separation
Adel Hidri* and Hamid Amiri
Laboratoire de Recherche: Signal,
Image et Technologie de l’Information (LR-SITI),
École Nationale d’Ingénieurs de Tunis (ENIT),
BP 37, le Belvédère 1002 Tunis, Tunisia
Email: hidri_adel@yahoo.fr
Email: hamidlamiri@gmail.com
*Corresponding author
Abstract: This paper presents a Multichannel Speech Separation System (MCSS) based on new
beamforming frequency domain method. The beamformer exploits the spatial properties of the
source signals using a microphone array. Therefore, it is based on a prior knowledge of the
position of the speakers relative to the array. The proposed beamformer is defined with two
processing steps: the first one is to keep a unit gain of the desired signal and the other blocks
the wanted signal and minimises the output power of the interferences within only one step.
In order to separate multiple speakers, multiple beamformers are used simultaneously, where a
beamformer is computed for each source considering the remaining sources as interferers. We
test and evaluate the proposed MCSS on real recording mixtures extracted from ‘Multichannel
In-Car Speech Database’. The experimental results proved the effectiveness of the proposed
system in terms of speech separation. The quality of speech will be improved compared to the
state-of-the-art.
Keywords: speech signal; beamforming; microphone arrays; multichannel speech separation;
optimal filtering; spatial filter.
Reference to this paper should be made as follows: Hidri, A. and Amiri, H. (2016) ‘A microphone
array beamforming-based system for multi-talker speech separation’, Int. J. Signal and Imaging
Systems Engineering, Vol. 9, Nos. 4/5, pp.209–217.
Biographical notes: Adel Hidri received his PhD degree in Electrical Engineering in 2014 at
National Engineering School of Tunis (ENIT), Tunisia. He received the Master degree in
Electrical Engineering, option: Industrial Automation Compute, from the Higher School of
Science and Technology of Tunis in 2001 and DEA in Electrical Engineering, Automation and
Digital Signal Processing, from the National Engineering School of Tunis in 2004. He is Teacher
at the Higher Institute of Multimedia and Arts of Manouba, Tunisia, from 2004. He is currently a
member of LR-SITI. His current research interests include spatial filtering: beamforming,
microphone array, source separation, source extraction and noise reduction.
Hamid Amiri received the Diploma of Electrotechnics, Information Technique in 1978 and the
PHD degree in 1983 at the TU Braunschweig, Germany. He obtained the Doctorate’s Sciences in
1993. He was a Professor at the National Engineering School of Tunis, Tunisia, from 1987 to
2001. From 2001 to 2009, he was at the Riyadh College of Telecom and Information. Currently,
he is again at ENIT and he is a Head of LR-SITI (Research Laboratory: Signal, Image and
Information Technology). His research is focused on image processing, speech processing,
document processing and natural language processing.
1 Introduction
In a multiple speaker environment, speech of a speaker of
interest is contaminated by interference and noise, leading
to low intelligibility for human hearing and for further
advanced processing such as automatic speech recognition
(Sharma and Atkins, 2014; Kacur and Chudy, 2014).
However, the practical performance suffers from non-
stationarity of human speech and reverberation of the
recording environment. One big challenge in this area is to
extract or separate concurrent speech. Addressing this issue,
a number of approaches have been proposed (Makino et al.,
2007; Naik and Wang, 2014). They can be classified into
Blind Source Separation (BSS) method and beamforming
method (Hidri et al., 2012b). Beamforming techniques have
theoretically shown great potential for extracting the speech
signal of interest (Cohen et al., 2010). The concept of
‘beamforming’ refers to multichannel signal processing