Class-wise Centroid Distance Metric Learning for Acoustic Event Detection Xugang Lu 1 , Peng Shen 1 , Sheng Li 1 , Yu Tsao 2 , Hisashi Kawai 1 1 National Institute of Information and Communications Technology, Japan 2 Research Center for Information Technology Innovation, Academic Sinica, Taiwan xugang.lu@nict.go.jp Abstract Designing good feature extraction and classiﬁer models is es- sential for obtaining high performances of acoustic event de- tection (AED) systems. Current state-of-the-art algorithms are based on deep neural network models that jointly learn the fea- ture representation and classiﬁer models. As a typical pipeline in these algorithms, several network layers with nonlinear trans- forms are stacked for feature extraction, and a classiﬁer layer with a softmax transform is applied on top of these extracted features to obtain normalized probability outputs. This pipeline is directly connected to a ﬁnal goal for class discrimination without explicitly considering how the features should be dis- tributed for inter-class and intra-class samples. In this paper, we explicitly add a distance metric constraint on feature extrac- tion process with a goal to reduce intra-class sample distances and increase inter-class sample distances. Rather than estimat- ing the pair-wise distances of samples, the distances are efﬁ- ciently calculated between samples and class cluster centroids. With this constraint, the learned features have a good proper- ty for improving the generalization of the classiﬁcation models. AED experiments on an urban sound classiﬁcation task were carried out to test the algorithm. Results showed that the pro- posed algorithm efﬁciently improved the performance on the current state-of-the-art deep learning algorithms. Index Terms: acoustic event detection, distance metric learn- ing, class centroids, convolutional neural network. 1. Introduction Acoustic scene and event detection (AED) is important for au- dio content analysis and audio information retrieval [1, 2, 3, 4, 5, 6]. In most AED algorithms, feature extraction and classi- ﬁer modeling are included in a typical pipeline. How to de- sign discriminative features for AED is essential to obtain a good performance of AED systems, especially for the gener- alization ability of the systems. With successful applications of deep learning (DL) framework in image and speech process- ing and recognition, the DL framework also has been applied in the AED tasks. The advantage of this DL framework is that they can automatically learn discriminative features and clas- siﬁers in a joint learning framework. Many models with var- ious types of network architectures have been proposed in the DL framework. For example, the convolutional neural network (CNN) model can explore temporal- and/or frequency-shift in- variant features for AED [7, 8, 9]. The recurrent neural network (RNN) model can extract long temporal-context information in feature representation for classiﬁcation. With long short ter- m memory (LSTM) units [10] or gated recurrent units (GRU) [11], the RNN can be efﬁciently trained for AED. Models that combines the advantages of the CNN and RNN also have been proposed, e.g., convolutional recurrent neural network (CRN- N) model, where the CNN is used to explore frequency-shift invariant feature while the RNN is used to model the temporal structure in classiﬁcation [12, 13]. In the DL framework for AED, there are two basic steps in modeling, one is how to encode the acoustic signals with vari- ous time durations to ﬁxed-dimension feature vectors, the oth- er is how to design a classiﬁer to model these encoded feature vectors for classiﬁcation. Correspondingly, in most DL based algorithms, a feature process module that stacks several neu- ral network layers with nonlinear transforms is used for feature extraction, and a classiﬁer module with a softmax transform is applied on top of these extracted features to obtain normalized probability outputs. The optimization goal is directly connected to the ﬁnal classiﬁcation accuracy on a training data set. Since the optimization is directly connected to the classiﬁcation accu- racy on a training data set, there is no guarantee of whether the extracted features are discriminative or not for a test set. Fea- tures as intermediate outputs in optimization, there is no explicit constraint on how their distributions should be. In consequence, it is easy for learned models to be overﬁtted to training data sets with weak generalization ability to testing sets. Therefore, ex- plicit constraints should be given to feature extraction process in the DL framework. Intuitively, distribution of discriminative features of samples should have a small intra-class variation and large inter-class variation. Based on this intuition, several algo- rithms have been designed to combine distance metric learning in feature extraction. For example, in a large category of ma- chine learning, feature learning takes into account of intra- and inter-class pair-wise distance measurements [14, 15, 16]. In the DL framework, nonlinear distance metric learning has been pro- posed for different applications [17, 18, 19, 20, 21], they all take a similar idea in feature extraction with the pair-wise Siamese network models as originally proposed in [23, 24, 19]. As a further generalization of the idea based on pair-wise Siamese network for feature extraction, triplet loss was proposed [22]. Most of these algorithms learn features with consideration of the feature distance or relation of intra-class (or positive) and inter-class (or negative) samples. In order to reduce the large computational complexity due to the large number of pair-wise or triple-wise sample combi- nations, the center loss based algorithm [25] was proposed for discriminative feature extraction. In this algorithm, the intra- class center loss was applied as a constraint in discriminative feature extraction. The center loss is deﬁned based on intra- class sample distances to their own class centroids where the centroid is an average of all samples in each class. Inspired by this center loss based idea, in our algorithm for AED, we explic- itly add a distance metric constraint in feature extraction with a goal to reduce intra-class sample distances and increase inter- class sample distances. The distances are efﬁciently calculat- ed between samples and class cluster centroids. With this con- straint, the learned features have a good property for improving the generalization of the classiﬁcation model. Our contributions Copyright  2019 ISCA INTERSPEECH 2019 September 15–19, 2019, Graz, Austria http://dx.doi.org/10.21437/Interspeech.2019-2271 3614