Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For reprint or republication permission, email to IEEE Copyrights Manager at pubs-permissions@ieee.org. All rights reserved. Copyright ©2020 by IEEE. Speaker Identification Using a Hybrid CNN-MFCC Approach Aweem Ashar Department of Computer Science ComsatsUniversity Lahore, Pakistan sp16-bse-118@cuilahore.edu.pk Muhammad Shahid Bhatti Department of Computer Science Comsats University Lahore, Pakistan msbhatti@cuilahore.edu.pk Usama Mushtaq Department of Computer Science Comsats University Lahore, Pakistan sp16-bse-093@cuilahore.edu.pk Abstract—In this paper, a novel architecture is proposed using a convolutional neural network (CNN) and mel frequency cepstral coefficient (MFCC) to identify the speaker in a noisy environment. This architecture is used in a text-independent setting. The most important task in any text-independent speaker identification is the capability of the system to learn features that are useful for classification. We are using a hybrid feature extraction technique using CNN as a feature extractor combines with MFCC as a single set. For classification, we used a deep neural network which shows very promising results in classifying speakers. We made our dataset containing 60 speakers, each speaker has 4 voice samples. Our best hybrid model achieved an accuracy of 87.5%. To verify the effectiveness of this hybrid architecture, we use parameters such as accuracy and precision. Keywords—Convolutional Neural Network, Mel Frequency Cepstral Coefficients, Feature Extraction, Text Independent, Speaker Identification, Deep Neural Network I. INTRODUCTION Voice is a basic or you can say the most important part of a human's everyday routine. The question that we are solving in this paper is, who is speaking? Speaker identification is the process of consequently distinguishing the individual talking in the voice sample. Speaker identification is a topic that gained a lot of attention in the research community. It is the most challenging task because every speaker is different in terms of accents, speaking style, frequency of words and vocal tract. The presence of noise, background chatter and music also makes the task even more difficult [1]. The conditions such as faulty recording device also affect classification accuracy. In speaker identification closed-set and text- independent setting, the voice must be from an enrolled speaker and does not depend upon the said words of the speaker. The main approaches include is i-vector [2], [3], hidden Markov model [4], Gaussian Mixture Model (GMM) with Universal Background Model (UBM) [5], [6], vector quantization [7], neural network [8-11], support vector machine [12]. These approaches use various types of the dataset for speaker identification task. Some datasets are recorded in a laboratory environment where there is no noise and the others have little noise or chatter. Recent advancements are done in using a convolutional neural network for speaker identification tasks because they can easily handle noisy datasets and there is no need for feature engineering, as feature extraction and classification both can be done by CNN. Different classification and feature extraction techniques have been used in the speaker identification task, one of them is using CNN for both feature extraction and classification [13]. Some researchers used CNN only as a feature extractor and used other classifiers for classification [14]. Researchers also used MFCC features for speaker identification but they are unreliable in noisy environments [15]. Speaker Identification has many applicable services such as authentication of speakers in telephone banking, control for confidential information, information services, access to remote computers and database access services. The speaker identification task still has to cover a lot of milestones to make it state of the art. If this system will become state of the art, it can replace the existing verification system. This system can also use to add an extra layer of security in the existing system. Fig. 1. Components for speaker identification. Fig.1 contains the block diagram of the whole speaker identification system. The proposed system contains dataset, pre-processing of voice samples reducing noise and remove silence, then making spectrograms of voices and use CNN as a feature extractor combine feature from CNN and features from MFCC combine and apply a feature selection technique on them and then pass these selected features to a DNN for classification. The fundamental objective of this paper is to structure and execute a speaker acknowledgment framework utilizing a neural network. The following sections cover the main components of this approach, including dataset and preprocessing (sec. II), proposed approaches (III), results (IV) and in the end conclusion (V). II. DATASET AND PRE-PROCESSING The following subsections include information about the dataset and its characteristics and pre-processing of the voice samples.