2023 Asia Paciﬁc Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Residual, Mixer, and Attention: The Three-way Combination for Streaming Wake Word Detection Framework Sattaya Singkul * Theerat Sakdejayont * and Tawunrat Chalothorn * * Innovation Research and Development, Kasikorn Labs, Nonthaburi, Thailand E-mail: {sattaya.s,theerat.s,tawunrat.c}@kbtg.tech Abstract—Speech interactions and digital assistants rely on ef- fective wake word detection models to identify predeﬁned words. In this study, we address the need for an efﬁcient wake word detection model and propose a “three-way residual separable convolution network” (3W-ResSC) inspired by human multi- perspective learning. Our 3W-ResSC model combines three infor- mation: residual learning (ResNet), point-wise pattern, and depth- wise patterns from a CNN mixer (Mi). In addition, to conclude and emphasize information, our independent multi-view attention (iMVA) is proposed based attention mechanism (At). Based on ResNet, Mi, and At, we propose three different combinations for our model; ResNet+Mi, ResNet+At, and ResNet+Mi+At. By lever- aging the strengths of these three combinations, our combinations are investigated to ﬁnd the outstanding performance and pattern. To evaluate our model, we conduct experiments on multiple datasets, including Gowajee, HeyFireFox, HeySnips, and GSCv2. The results demonstrate that our 3W-ResSC outperforms baseline models in terms of equal error rate (EER), Matthews correlation coefﬁcient (MCC), and false rejection rate (FRR) in multi-class classiﬁcation. Additionally, we introduce a wake word detection framework speciﬁcally designed for streaming processing. This framework leverages an independent speech window, resembling a buffer-like streaming production process. Overall, our proposed 3W-ResSC model and wake word detection framework offer signiﬁcant advancements in wake word detection, showcasing improved performance and efﬁciency, which discuss in various datasets and training sizes. These ﬁndings contribute to the development of more effective wake word detection and speech processing when using ResNet, Mi, and At in deep learning. I. I NTRODUCTION Speech interactions and digital assistants built into smart- phones and voice command devices require a wake word detection system to anticipate the detection of predeﬁned words [1]. For this, a lightweight model with small memory footprint, small computational cost, and low latency is needed [2] to quickly activate the device as soon as the wake word appears in streaming audio. In recent work, a deeper model requires high computational resources [3]–[5] for training and inference based on a number of parameters and model complexity, such as Wav2KWS [6] and Keyword Transformer [7], which are not appropriate to deploy on mobile devices with limited processing unit and memory. For smaller model, the recurrent neural network (RNN), is sequential in nature and perform well with sequen- tial frame information. However, it prevents the parallel chunk- streaming process on GPUs from being clogged by previous frames. Therefore, most current systems apply a convolutional neural network (CNN) for a frame-by-frame processing, which does not required previous frames. A CNN kernel is repeatedly applied by sliding through time or frequency and covers a small and ﬁxed frame length. Each kernel can capture local patterns, which can expand the receptive ﬁeld by stacking CNN layers. The expanded receptive ﬁeld of the higher layers can “see” longer frame lengths compared to the lower layers, which capture more global patterns. These are why CNN is better utilized resource than RNN and deeper models on devices. However, when learning with highly concentrated deep details, the CNN model falls short. Highly CNN layers lose some information and may experience vanishing gradient problems [4], [8]. By adding function and skipping connections, ResNet [9] solves this problem. Res8 [8] model presents the dilated CNN pattern with ResNet to achieved high accuracy. Though, the Res8 performance depends on a sufﬁcient amount of speech data, which may not be enough when the dataset size is small or complex data is present. As previously mentioned, we proposed a method to improve the efﬁciency of Res8 [8] for wake word model. Our “three- way residual separable convolution network (3W-ResSC)” is proposed, inspired by the concept of human multi-perspective learning [10]; that is, “the human brain ﬂexibly adapts to support the information-processing needs of different perspec- tives and can mix to extend performance,” as the same way that J¨ a¨ askel¨ ainen [10] described. Responding to our inspired concept, we implemented a learning method for deep learning with three kinds of information like multi-perspective learning: The ﬁrst information is the original input with residual learning (ResNet). For the second and third, point- and depth-wise patterns are mixed (Mi) based on 2D separable dilated CNN inspired by [11]. This process involves dividing a single convolution with dilation factor into two or more convolutions to produce the same output, enabling a more informative view for learning with a minimal computational requirements. Also, to conclude the human multi-perspective learning, our independent multi-view attention (iMVA) is proposed based on attention mechanism (At) that inspired by traditional MVA [12]. Our iMVA captures and learns three side-view compo- nents including channel, global, and local attention, speciﬁcally to emphasize and conclude the learning information. There-