IDENTIFICATION OF INDIAN LANGUAGES USING GHOST-VLAD POOLING Krishna D N, Ankita Patil, M.S.P Raj,Sai Prasad H S, Prabhu Aashish Garapati Youplus India, Bangalore {krishna, raj,ankita,saiprasad,aashish}@youplus.com ABSTRACT In this work, we propose a new pooling strategy for language identiﬁcation by considering Indian languages. The idea is to obtain utterance level features for any variable length audio for robust language recognition. We use the GhostVLAD ap- proach to generate an utterance level feature vector for any variable length input audio by aggregating the local frame level features across time. The generated feature vector is shown to have very good language discriminative features and helps in getting state of the art results for language identiﬁ- cation task. We conduct our experiments on 635Hrs of au- dio data for 7 Indian languages. Our method outperforms the previous state of the art x-vector [11] method by an absolute improvement of 1.88% in F1-score and achieves 98.43% F1- score on the held-out test data. We compare our system with various pooling approaches and show that GhostVLAD is the best pooling approach for this task. We also provide visu- alization of the utterance level embeddings generated using Ghost-VLAD pooling and show that this method creates em- beddings which has very good language discriminative fea- tures. Index Terms— Indian language identiﬁcation, GhostVLAD, Pooling methods. 1. INTRODUCTION The idea of language identiﬁcation is to classify a given audio signal into a particular class using a classiﬁcation al- gorithm. Commonly language identiﬁcation task was done using i-vector systems [1]. A very well known approach for language identiﬁcation proposed by N. Dahek et al. [1] uses the GMM-UBM model to obtain utterance level features called i-vectors. Recent advances in deep learning [15,16] have helped to improve the language identiﬁcation task us- ing many different neural network architectures which can be trained efﬁciently using GPUs for large scale datasets. These neural networks can be conﬁgured in various ways to obtain better accuracy for language identiﬁcation task. Early work on using Deep learning for language Identiﬁca- tion was published by Pavel Matejka et al. [2], where they used stacked bottleneck features extracted from deep neural networks for language identiﬁcation task and showed that the bottleneck features learned by Deep neural networks are better than simple MFCC or PLP features. Later the work by I. Lopez-Moreno et al. [3] from Google showed how to use Deep neural networks to directly map the sequence of MFCC frames into its language class so that we can apply language identiﬁcation at the frame level. Speech signals will have both spatial and temporal information, but simple DNNs are not able to capture temporal information. Work done by J. Gonzalez-Dominguez et al. [4] by Google developed an LSTM based language identiﬁcation model which improves the accuracy over the DNN based models. Work done by Alicia et al. [5] used CNNs to improve upon i-vector [1] and other previously developed systems. The work done by Daniel Garcia-Romero et al. [6] has used a combination of Acoustic model trained for speech recognition with Time- delay neural networks where they train the TDNN model by feeding the stacked bottleneck features from acoustic model to predict the language labels at the frame level. Recently X- vectors [7] is proposed for speaker identiﬁcation task and are shown to outperform all the previous state of the art speaker identiﬁcation algorithms and are also used for language iden- tiﬁcation by David Snyder et al. [8]. In this paper, we explore multiple pooling strategies for language identiﬁcation task. Mainly we propose Ghost- VLAD based pooling method for language identiﬁcation. Inspired by the recent work by W. Xie et al. [9] and Y. Zhong et al. [10], we use Ghost-VLAD to improve the accuracy of language identiﬁcation task for Indian languages. We explore multiple pooling strategies including NetVLAD pooling [11], Average pooling and Statistics pooling( as proposed in X- vectors [7]) and show that Ghost-VLAD pooling is the best pooling strategy for language identiﬁcation. Our model ob- tains the best accuracy of 98.24%, and it outperforms all the other previously proposed pooling methods. We conduct all our experiments on 635hrs of audio data for 7 Indian lan- guages collected from All India Radio news channel 1 . The paper is organized as follows. In section 2, we explain the proposed pooling method for language identiﬁcation. In sec- tion 3, we explain our dataset. In section 4, we describe the experiments, and in section 5, we describe the results. 1 http://www.newsonair.com/ arXiv:2002.01664v1 [cs.CL] 5 Feb 2020