Using Deep Neural Networks for Identiﬁcation of Slavic Languages from Acoustic Signal Lukas Mateju, Petr Cerva, Jindrich Zdansky, Radek Safarik Faculty of Mechatronics, Informatics and Interdisciplinary Studies, Technical University of Liberec, Studentska 2, 461 17 Liberec, Czech Republic {lukas.mateju, petr.cerva, jindrich.zdansky, radek.safarik}@tul.cz Abstract This paper investigates the use of deep neural networks (DNNs) for the task of spoken language identiﬁcation. Vari- ous feed-forward fully connected, convolutional and recurrent DNN architectures are adopted and compared against a baseline i-vector based system. Moreover, DNNs are also utilized for ex- traction of bottleneck features from the input signal. The dataset used for experimental evaluation contains utterances belonging to languages that are all related to each other and sometimes hard to distinguish even for human listeners: it is compiled from recordings of the 11 most widespread Slavic languages. We also released this Slavic dataset to the general public, because a sim- ilar collection is not publicly available through any other source. The best results were yielded by a bidirectional recurrent DNN with gated recurrent units that was fed by bottleneck features. In this case, the baseline ER was reduced from 4.2% to 1.2% and Cavg from 2.3% to 0.6%. Index Terms: language identiﬁcation, Slavic languages, deep neural networks, convolutional neural networks, recurrent neu- ral networks 1. Introduction Spoken language identiﬁcation (LID) is the task of correctly determining the language spoken in a speech utterance. In re- cent years, many scientiﬁc efforts have been dedicated to this task, and nowadays, LID modules form an integral part of many speech processing applications including, e.g., systems for multilingual speech recognition or spoken language transla- tion. LID systems are also used for spoken document retrieval, emergency call-routing or in dialog systems. Although the ac- curacy of all these systems is constantly improving, it is still not perfect. For example, one of the signiﬁcant bottlenecks of LID systems is to distinguish between closely related languages. Most of the state-of-the-art LID systems utilize various ad- vanced acoustic modeling techniques. One of the most popular techniques relies on the total vari- ability factor analysis, and it is known as an i-vector frame- work [1, 2]. I-vector is a ﬁxed length representation of an ut- terance, and it jointly contains information about the speaker, language, etc. (e.g., LDA might be applied to obtain discrimina- tive features). To extract i-vector features, hand-crafted shifted delta cepstral features (SDC) derived from mel-frequency cep- stral coefﬁcients (MFCCs) [3] and phone log-likelihood ratios (PLLRs) [4] are most commonly used as inputs. The i-vector extraction is usually followed by a classiﬁcation stage, where multiclass logistic regression, cosine scoring or Gaussian mod- els are utilized. The major drawback of the i-vector approach is the decreasing performance on shorter test utterances [5]. Over the past few years, deep neural networks have had an upsurge in popularity in LID systems thanks to their outstanding performance in many other speech processing applications (e.g., speech recognition [6]). Both direct and indirect approaches exist for utilizing deep learning for LID. In the former case, so-called bottleneck features (BTNs) are widely used in many systems [7, 8, 9] due to their superior per- formance. Usually, these features are extracted from a DNN trained to discriminate individual physical states of a tied-state triphone model at ﬁrst, and then used as inputs to an i-vector based system [10, 11]. In the latter case, various end-to-end systems based on dif- ferent DNN architectures are trained to identify the language in the input utterance. In 2014, feed-forward DNN yielded excel- lent results on short utterances (less than 3 seconds) [5]. Since then, other more advanced architectures, such as attention based DNNs [12], convolutional neural networks (CNNs) [13, 14, 15], time delay neural networks (TDNNs) [16, 17] or sequence sum- marizing neural networks (SSNNs) [18] have also been success- fully used. The most recent direct approaches take advantage of recurrent neural networks (RNNs) and their context modeling ability. Gated recurrent unit (GRU) RNNs [19], long short-term memory (LSTM) RNNs [20, 21, 22, 23, 24] and bidirectional LSTM RNNs [25, 26] all yield the state-of-the-art performance. In this paper, various state-of-the-art LID methods are in- vestigated. We adopt feed-forward DNNs at ﬁrst, then CNNs, and ﬁnally also unidirectional as well as bidirectional RNNs with both previously mentioned types of units. We also com- bine these direct methods with the indirect approach: we feed the networks with bottleneck features. To the best of our knowl- edge, results of some of these approaches and their comparison on one dataset have not yet been published for LID. The experimental evaluation is performed on a dataset con- sisting of the 11 most widespread Slavic languages. These were selected for two main reasons. The ﬁrst is that most of these languages are related to each other which makes our dataset more challenging. This is espe- cially true for those pairs of languages that belong to the same language branch. For example, it is difﬁcult to distinguish be- tween Croatian and Serbian (South Slavic branch), even for na- tive speakers. Secondly, only results obtained for several (pairs) of Slavic languages have been published so far (e.g., [27]). For exam- ple, Polish and Russian formed one cluster of related languages within the last Language Recognition Evaluation (LRE) chal- lenge in 2017 [28]. On the contrary, a detailed analysis for all evaluated Slavic languages using a confusion matrix is pre- sented in this work. Finally, note that our dataset of Slavic languages is available for download to the general public 1 . 1 https://owncloud.cesnet.cz/index.php/s/gXHKFs9UDEqe34G Interspeech 2018 2-6 September 2018, Hyderabad 1803 10.21437/Interspeech.2018-1165