Published By: Blue Eyes Intelligence Engineering & Sciences Publication International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-9 Issue-2, December 2019 4224 Retrieval Number: B7647129219/2019©BEIESP DOI: 10.35940/ijitee.B7647.129219 Abstract: State-of-art speaker recognition system uses acoustic microphone speech to identify/verify a speaker. The multimodal speaker recognition system includes modality of input data recorded using sources like acoustics mic,array mic ,throat mic, bone mic and video recorder. In this paper we implemented a multi-modal speaker identification system with three modality of speech as input, recorded from different microphones like air mic, throat mic and bone mic . we propose and claim an alternate way of recording the bone speech using a throat microphone and the results of a implemented speaker recognition using CNN and spectrogram is presented. The obtained results supports our claim to use the throat microphone as suitable mic to record the bone conducted speech and the accuracy of the speaker recognition system with signal speech recorded from air microphone get improved about 10% after including the other modality of speech like throat and bone speech along with the air conducted speech. Keywords : Throat Speech,Bone Speech,Speaker Identification,CNN,Multi-modal Speaker Recognition. I. INTRODUCTION Automatic speaker recognition is a way in which the machines are used to identify/recognize the speaking person using the speech information.ASR has been a research interest for many decades; the transition of the technologies used in ASR is the interesting key factor to make the research challenging one. The challenges includes in the feature extraction techniques, speaker modeling and in the decision making techniques. The features depict the identity of the speaking person and the modeling the features involves the representation of the speaker and these models are used to identify/recognize the speaker. The pipeline of the ASR system involves Speech data collection, feature extraction , model training ,model testing and the evaluation as shown below Fig: 1. The performance of the ASR depends on techniques and technologies used in each step in the pipeline. The quality of the speech depends on recording device and the ambiance of the recording environments sound vibrations in the air ,whereas the throat pickups the sound vibrations near the vocal chords and the bone mic pickups the sound vibrations from the bones like skull. The AM signals contain the environmental back ground noise. The TM and BC signals are in-contact with skin/surface, that are void from the back ground noise. Revised Manuscript Received on December 05, 2019. * Correspondence Author Khadar Nawas K*, SCSE, Vellore Institute of Technology,Chennai,India. * Correspondence Author A Nayeemulla Khan, SCSE, Vellore Institute of Technology , Chennai,India. Fig : 1 ASR System Pipeline sound vibrations in the air ,whereas the throat pickups the sound vibrations near the vocal chords and the bone mic pickups the sound vibrations from the bones like skull. The AM signals contain the environmental back ground noise. The TM and BC signals are in-contact with skin/surface, that are void from the back ground noise. Air Microphone (AM) The condenser microphone's speech is commonly used in speech processing studies. These data are referred as Air- conduction speech, a condenser mic capture the vibrations through the air medium and convert them to speech signals. The AM speech is affected by the background noise. The intelligibility of the AM speech signal get affected the background noise but the AM speech contains all the information from the higher to the lower frequencies. Throat Microphone (TM) The throat mic uses the piezoelectric transducer to sense the vocal cord vibration that is positioned near the larynx in contact with the skin of the throat. It collects the speech signals transferred by the sound vibrations along with the larynx tone. Because of its skin contact, it is less prone to the environment blare compared to the conventional microphone that senses the differences in air pressure and hence the environment noise gets captured. The speech of the throat microphone has less intelligibility due to filtering of the higher frequency by the skin and muscles at the larynx region, though it has speech signal with the speaker’s characteristic features. The spectral features of some sound units differ from the normal microphone speech’s sound units. There exits few distinctive spectral features in the TM speech compared to the AM speech. The presence of such spectral characteristics in the TM speech could be used to construct a speaker recognition system [1]. In the TM and AM voice, the spectral characteristics of certain sounds emerge to be complimenting one another by nature. The existence of such complimentary speaker specific spectral features of both voice signals results in increased efficiency of speaker recognition systems. Bone Speech A CNN based Speaker Recognition System using an Alternate Bone Microphone Khadar Nawas K, A Nayeemulla Khan