Unsupervised Acoustic Segmentation and Clustering using Siamese Network Embeddings Saurabhchand Bhati † , Shekhar Nayak ‡ , K. Sri Rama Murty ‡ , Najim Dehak † † Center for Language and Speech Processing, The Johns Hopkins University, USA ‡ Department of Electrical Engineering, IIT Hyderabad, India sbhati1@jhu.edu, ee13p1008@iith.ac.in, ksrm@iith.ac.in, ndehak3@jhu.edu Abstract Unsupervised discovery of acoustic units from the raw speech signal forms the core objective of zero-resource speech process- ing. It involves identifying the acoustic segment boundaries and consistently assigning unique labels to acoustically similar seg- ments. In this work, the possible candidates for segment bound- aries are identified in an unsupervised manner from the kernel Gram matrix computed from the Mel-frequency cepstral coef- ficients (MFCC). These segment boundary candidates are used to train a siamese network, that is intended to learn embeddings that minimize intrasegment distances and maximize the inter- segment distances. The siamese embeddings capture phonetic information from longer contexts of the speech signal and en- hance the intersegment discriminability. These properties make the siamese embeddings better suited for acoustic segmentation and clustering than the raw MFCC features. The Gram matrix computed from the siamese embeddings provides unambiguous evidence for boundary locations. The initial candidate bound- aries are refined using this evidence, and siamese embeddings are extracted for the new acoustic segments. A graph grow- ing approach is used to cluster the siamese embeddings, and a unique label is assigned to acoustically similar segments. The performance of the proposed method for acoustic segmentation and clustering is evaluated on Zero Resource 2017 database. Index Terms: Zero resource speech processing, Representation learning, Spoken term discovery, Siamese network 1. Introduction The advent of deep learning techniques, together with improved computational resources, has lead to significant improvements in the performance of automatic speech recognition (ASR) sys- tems [1]. Most of the state-of-the-art ASR systems require thou- sands of hours of manually transcribed speech data [2, 3], lexi- con and pronunciation dictionary [4]. The recent technological advances in ASR systems cannot be applied to under-resourced languages, for example, regional languages, for which tran- scribed speech data are not readily available. Hence, there is a pressing need to explore alternate methods for developing speech interfaces for under-resourced languages. Such approaches commonly referred to as zero-resource speech processing [5,6] has several applications, which include, but not limited to, building speech interfaces in under-resourced languages, preserving endangered languages. At the core of the issues, in zero-resource speech processing, lies unsuper- vised discovery of linguistic units from the raw speech wave- form [7–16]. Linguistic unit discovery, in turn, involves seg- menting the speech waveform into acoustically homogeneous regions and consistently assigning unique labels to the segments with similar acoustic properties. Hence, the representation of the speech signal plays a vital role in the unsupervised discov- ery of linguistic units from the speech signal. The acoustic speech waveform carries information about several sources, including linguistic units, speaker, emotion, etc. The traditional features, extracted from the magnitude spec- tral envelope of the speech signal, capture information about all these sources. For example, Mel-frequency cepstral coef- ficients (MFCC) have been widely used for speech recogni- tion systems [17], language identification [18], and emotion recognition [19], as MFCC capture information about all these sources. In supervised ASR, a powerful classifier, like deep neural network (DNN), learns a nonlinear map between the MFCC features and the manual transcriptions [1]. Hence an ASR system can efficiently recognize the linguistic units from the MFCC features, even though they carry other information as well. However, in the zero-resource scenario, manual tran- scriptions are not available to guide the model in selecting only the relevant linguistic information. Hence, it is important to use improved features for zero resource applications as opposed to generic features like MFCC. In this paper, we propose to extract features that highlight the speech-specific characteristics required for the application like unsupervised spoken term discovery. We propose full- coverage segmentation and clustering, where entire data is seg- mented and labeled. Full coverage system would allow devel- opment of speech indexing and query-by-example [20] search system in an unsupervised manner. The extracted features from the Gaussian mixture model (GMM) or autoencoder are expected to occupy orthogonal sub- spaces for different speech units which helps in achieving better inter-phone discrimination [21, 22]. Both GMMs and autoen- coders, however, do not model the time-sequence in which the feature vectors evolve. As a consequence, the new representa- tion may vary even within a phoneme segment, which subse- quently may lead to ambiguity at the clustering stage. Here, we learn unit-level features that are consistent with the segment properties as opposed to frame-level features. Sequence-aware representation learning methods [23, 24] require initial segmentation and labeling information. Given preliminary segments and labels, correspondence autoen- coder(CAE) [23], ABnet [24] learns feature representations that minimize distances among different instances of the same la- bel. The learned features from CAE, ABnet outperform the MFCC features on both ABX and spoken term discovery (STD) tasks [23]. Both the CAE and the ABnet require segmentation and label information for extracting speaker-independent repre- sentation. Features learned after the clustering step combines the errors of both the segmentation and clustering steps. Usu- ally, the segmentation step performs better than clustering step. So, we move the feature learning step before clustering and learn features directly from the segmentation information and Copyright 2019 ISCA INTERSPEECH 2019 September 15–19, 2019, Graz, Austria http://dx.doi.org/10.21437/Interspeech.2019-2981 2668