Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limit of U.S. copyright
law for private use of patrons those articles in this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in the code is
paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For reprint or republication permission, email to IEEE Copyrights
Manager at pubs-permissions@ieee.org. All rights reserved. Copyright ©2020 by IEEE.
Speaker Identification Using a Hybrid CNN-MFCC
Approach
Aweem Ashar
Department of Computer Science
ComsatsUniversity
Lahore, Pakistan
sp16-bse-118@cuilahore.edu.pk
Muhammad Shahid Bhatti
Department of Computer Science
Comsats University
Lahore, Pakistan
msbhatti@cuilahore.edu.pk
Usama Mushtaq
Department of Computer Science
Comsats University
Lahore, Pakistan
sp16-bse-093@cuilahore.edu.pk
Abstract—In this paper, a novel architecture is proposed
using a convolutional neural network (CNN) and mel
frequency cepstral coefficient (MFCC) to identify the
speaker in a noisy environment. This architecture is used in
a text-independent setting. The most important task in any
text-independent speaker identification is the capability of
the system to learn features that are useful for classification.
We are using a hybrid feature extraction technique using
CNN as a feature extractor combines with MFCC as a single
set. For classification, we used a deep neural network which
shows very promising results in classifying speakers. We
made our dataset containing 60 speakers, each speaker has 4
voice samples. Our best hybrid model achieved an accuracy
of 87.5%. To verify the effectiveness of this hybrid
architecture, we use parameters such as accuracy and
precision.
Keywords—Convolutional Neural Network, Mel Frequency
Cepstral Coefficients, Feature Extraction, Text Independent,
Speaker Identification, Deep Neural Network
I. INTRODUCTION
Voice is a basic or you can say the most important part
of a human's everyday routine. The question that we are
solving in this paper is, who is speaking? Speaker
identification is the process of consequently distinguishing
the individual talking in the voice sample. Speaker
identification is a topic that gained a lot of attention in the
research community. It is the most challenging task
because every speaker is different in terms of accents,
speaking style, frequency of words and vocal tract. The
presence of noise, background chatter and music also
makes the task even more difficult [1]. The conditions such
as faulty recording device also affect classification
accuracy. In speaker identification closed-set and text-
independent setting, the voice must be from an enrolled
speaker and does not depend upon the said words of the
speaker.
The main approaches include is i-vector [2], [3], hidden
Markov model [4], Gaussian Mixture Model (GMM) with
Universal Background Model (UBM) [5], [6], vector
quantization [7], neural network [8-11], support vector
machine [12]. These approaches use various types of the
dataset for speaker identification task. Some datasets are
recorded in a laboratory environment where there is no
noise and the others have little noise or chatter. Recent
advancements are done in using a convolutional neural
network for speaker identification tasks because they can
easily handle noisy datasets and there is no need for feature
engineering, as feature extraction and classification both
can be done by CNN. Different classification and feature
extraction techniques have been used in the speaker
identification task, one of them is using CNN for both
feature extraction and classification [13]. Some researchers
used CNN only as a feature extractor and used other
classifiers for classification [14]. Researchers also used
MFCC features for speaker identification but they are
unreliable in noisy environments [15].
Speaker Identification has many applicable services
such as authentication of speakers in telephone banking,
control for confidential information, information services,
access to remote computers and database access services.
The speaker identification task still has to cover a lot of
milestones to make it state of the art. If this system will
become state of the art, it can replace the existing
verification system. This system can also use to add an
extra layer of security in the existing system.
Fig. 1. Components for speaker identification.
Fig.1 contains the block diagram of the whole speaker
identification system. The proposed system contains
dataset, pre-processing of voice samples reducing noise
and remove silence, then making spectrograms of voices
and use CNN as a feature extractor combine feature from
CNN and features from MFCC combine and apply a
feature selection technique on them and then pass these
selected features to a DNN for classification. The
fundamental objective of this paper is to structure and
execute a speaker acknowledgment framework utilizing a
neural network.
The following sections cover the main components of
this approach, including dataset and preprocessing (sec. II),
proposed approaches (III), results (IV) and in the end
conclusion (V).
II. DATASET AND PRE-PROCESSING
The following subsections include information about
the dataset and its characteristics and pre-processing of the
voice samples.