Vol.:(0123456789) 1 3
International Journal of Speech Technology
https://doi.org/10.1007/s10772-019-09630-9
Thorough evaluation of TIMIT database speaker identifcation
performance under noise with and without the G.712 type handset
Musab T. S. Al‑Kaltakchi
1
· Raid Raf Omar Al‑Nima
2
· Mohammed A. M. Abdullah
3
· Hikmat N. Abdullah
4
Received: 19 March 2019 / Accepted: 28 August 2019
© Springer Science+Business Media, LLC, part of Springer Nature 2019
Abstract
In this work, a speaker identifcation system is proposed which employs two feature extraction models, namely: the power
normalized cepstral coefcients and the mel frequency cepstral coefcients. Both features are subjected to acoustic modeling
using a Gaussian mixture model–universal background model. The purpose of this work is to provide a thorough evaluation
of the efect of diferent types of noise on the speaker identifcation accuracy (SIA) and thereby providing benchmark fgures
for future comparative studies. In particular, the additive white Gaussian noise and eight non-stationary noise types (with
and without the G.712 type handset) corresponding to various signal to noise ratios are tested. Fusion strategies are also
employed using late fusion methods: maximum, weighted sum, and mean fusion. The measurements of randomly selected
120 speakers from the TIMIT database are employed and the SIA is used to measure the system performance. The weighted
sum fusion resulted in the best performance in terms of SIA with noisy speech. The proposed model given in this work and
its related analysis paves the way for further work in this important area.
Keywords Speaker identifcation · TIMIT-database · Stationary and non-stationary background noise · G.712 type handset
1 Introduction
Several biometrics traits have been proposed employing var-
ious traits (Chaki et al. 2019) such as speech biometric (Sun
et al. 2019), fngerprint (Rajeswari et al. 2017), fnger tex-
ture (Al-Nima et al. 2017), face (Sghaier et al. 2018), signa-
ture (Morales et al. 2017), human ear and palmprint (Hezil
and Boukrouche 2017), sclera (Alkassar et al. 2015) and iris
pattern (Abdullah et al. 2015).
An important application in biometrics and forensics
is to identify speakers based on their unique voice pattern
which is known as speaker recognition (Togneri and Pullella
2011). There are many areas where this technique can be
successfully applied for security and investigation perspec-
tive including forensics, remote access control, web services
and online banking (El-Ouahabi et al. 2019).
Traditionally, speaker recognition systems were devel-
oped and tested in a clean speech environment. However, in
many applications of speaker recognition, the speech sam-
ples provided to the system may sufer from diferent types
of noise. In order to achieve a robust speaker identifcation,
the efect of noise should be investigated as the noise can
badly afect the performance of a speaker recognition sys-
tem (Ming et al. 2007). According to Verma and Das (2015),
feature extraction within speaker identifcation should be
less infuenced by noise or the person’s health.
In this work, we present a thorough evaluation for the
TIMIT database under a wide range of environmental noise
conditions, hence, providing benchmark evaluations for
other researchers working in the speaker identifcation feld.
In summary, our contributions are as follows.
• Eight NSN types, as well as the AWGN with and without
the G.712 type handset are investigated.
• The relation between the SIAs for eight NSN and AWGN
with the signal to noise ratios (SNRs) is measured.
* Musab T. S. Al-Kaltakchi
musab.tahseen@gmail.com
1
Department of Electrical Engineering, College
of Engineering, Mustansiriyah University, Baghdad, Iraq
2
Technical Engineering College of Mosul, Northern Technical
University, Mosul, Iraq
3
Computer and Information Engineering Department, College
of Electronics Engineering, Ninevah University, Mosul, Iraq
4
College of Information Engineering, Al-Nahrain University,
Baghdad, Iraq