Channel-selection for distant-speech recognition on CHiME-5 dataset Hannes Unterholzner 1 , Lukas Pfeifenberger 1 , Franz Pernkopf 1 Marco Matassoni 2 , Alessio Brutti 2 , Daniele Falavigna 2 1 Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria 2 Fondazione Bruno Kessler, Center for Information and Communication Technology, Trento, Italy hunterholzner@fbk.eu Abstract The 5th CHiME Speech Separation and Recognition Chal- lenge represents a realistic scenario to validate the variety of techniques required to properly handle conversational multi- party speech acquired with distant microphones. We address the problem of channel selection using a DNN-based channel classifier that predicts good channels according to the oracle re- sults. In combination with ROVER as a final combination step, we can improve the performance with respect to the baseline system. 1. Introduction The paper discusses the scenarios associated to the multiple- array track of the challenge [1], considering all the channels available from the six Microsoft Kinect devices. We investi- gate the applicability of a channel-selection approach based on purely acoustic features (i.e. features that capture spatial in- formation about the desired speech source) in order to identify a subset of candidate channels to combine after the decoding stage. Furthermore we discuss the approach of acoustic model adaptation [2]. We adopted the three baselines for array syn- chronization, enhancement (BeamformIt [3]), and conventional ASR based on a time-delayed neural network (TDNN) using lattice-free maximum mutual information (LF-MMI) [4]. 2. Components ... BeamformIt acoustic adaptation Channel selection decoding ROVER Q-E Figure 1: The architecture of the proposed CHiME-5 automatic transcription system: signal enhancement based on Beamfor- mIt; DNN-based channel classifier; the multiple input is de- coded using a DNN-based adapted AM that exploits prelim- inary automatic transcription of the speech at hand; the hy- potheses are finally combined to build the final output. Boxes colored in grey do not contribute in terms of improvement. 2.1. Channel selection The results on the a posteriori best channel selection (Sec 3.1) show that there is margin for impressive gain, if one is able to predict the best performing channel for each utterance. We de- fine the oracle channel as the best channel, providing lowest word error rate (WER) for a given utterance. However, the or- acle channel seems not to be related to the speaker position or to other spatial features. In this sense, it is extremely surprising that often two very close channels provides substantially dif- ferent results. One attractive approach is employing a neural network which receives as input signal based features (i.e. filter bank features) and predicts the oracle channel. Since multiple oracle channels are available, this is a multi-label multi-class problem. We attack this problem using a DNN, trained on a sub- set of the training set using the binary cross-entropy loss and a sigmoid activation at the output of the last layer. Then, one can either select the best channels taking the maximum score, or can provide a channel ranking for the successive ROVER stage. 2.2. Acoustic model adaptation It is known that adapting all the parameters of a DNN trained on a large corpus using a small adaptation set can generate overfitting. The solution adopted here is based on the princi- ple of transfer learning where an already trained net is used to learn another task with additional examples; in this case we use weight transfer i.e. the last layer of the DNN is trained with a higher learning rate. 2.3. Hypotheses combination The combination of multiple ASR hypotheses usually leads to significant improvement compared to the output of each indi- vidual system. ROVER, the most popular ASR system combi- nation approach, performs hypotheses fusion by first building a word confusion network (CN) from the 1-best hypotheses of the ASR systems entering the combination and then by select- ing the best word in each CN bin via majority voting [5]. The hypotheses combination process considers the first in- put candidate as a “skeleton” to align the other hypotheses in a greedy manner. For this reason, depending on the order in which the hypotheses are considered when feeding the algo- rithm, the resulting combination can show large variations in quality. In the past we developed a system [6, 7] for optimally ranking the ASR hypotheses that feed ROVER. However, due to time constraints, this system has not been applied yet, and ASR hypotheses produced for this challenge are ranked with the ap- proximate method described in Section 2.1. The application of the optimal ranking approach on CHIME-5 evaluation sets will be done in future. CHiME 2018 Workshop on Speech Processing in Everyday Environments 07 September 2018, Hyderabad, India 88 10.21437/CHiME.2018-20