ON-TALK AND OFF-TALK DETECTION: A DISCRETE WAVELET TRANSFORM ANALYSIS OF ELECTROENCEPHALOGRAM Fasih Haider 1 , Hayakawa Akira 1 , Saturnino Luz 2 , Carl Vogel 3 and Nick Campbell 1 1 Adapt Centre, 3 School of Computer Science and Statistics, Trinity College Dublin, Ireland. 2 Usher Institute of Population Health Sciences & Informatics, University of Edinburgh, UK ABSTRACT Spoken interaction with a machine results in a behaviour that is not very common in face-to-face human communication: Off-Talk, which is deﬁned as speech utterances that are not directed to an immediate interlocutor, the machine, but to an- other person or even oneself. It is our contention that a sys- tem which is able to detect the Off-Talk utterances can interact with a human in a more efﬁcient manner by acknowledging that the utterances are not directed to the system and hence, not replying to Off-Talk utterances. In this paper, we demon- strate the discrimination power of a wide range of Electroen- cephalogram (EEG) frequency bands using wavelet transform analysis and propose models for On-Talk and Off-Talk detec- tion using audio and EEG signals, and their fusion. Our study shows that the EEG signal can identify the occurrence of Off- Talk utterances with promising accuracy and its fusion with audio features adds a slight improvement in these results. Index Terms— multimodal interaction, dialogue sys- tem, brain-computer interface (BCI), electroencephalogram (EEG), on-off talk (speech) detection, multi-sensor fusion 1. INTRODUCTION It has been observed that when people interact with computer systems, not only do they talk to the computer system but also, they tend to talk to themselves and to other people if present [1, 2, 3]. Oppermann et al [1] coined the term “Off-Talk” to denote speech that is not addressed to the computer system, as opposed to utterances that are directed to it, and therefore need to be understood by the system as “On-Talk”. Batliner et al [2] open their paper with an example from Shakespeare’s Hamlet, where Hamlet seems to change his speaking style when addressing his interlocutor to utterances that are spo- ken, but not directed towards his interlocutor. It shows that this is not a new phenomena, but part of human nature that Shakespeare expressed with his characters [2]. The deﬁnition This research is supported by “ADAPT 13/RC/2106” project (http://www.adaptcentre.ie/) in the SCL (Speech Communication Lab) and DLab (Design and Innovation Lab) at Trinity College Dublin, the University of Dublin, Ireland. of Off-Talk, as provided by Oppermann et al [1, p. 1] encom- passes every utterance that is not directed to the system, such as: (i) soliloquy/thinking aloud, (ii) swearing, (iii) reading from displayed text aloud, (iv) conversation with other per- son(s) present, (v) telephone conversation (e.g., with cellular phone) and (vi) extrinsic speech (e.g., video player, TV set, etc.). The objective of this paper is to model On-Talk and Off-Talk in terms of EEG and audio features. Previous studies by Oppermann et al [1] report that the loudness difference between On-Talk and Off-Talk can be used as a signiﬁcant indicator of Off-Talk and Hayakawa et al [3] also suggest that the prosodic features can help the On- Talk and Off-Talk detection. One of the contributions of the present study is the demonstration of discrimination power of EEG frequency bands for On-Talk and Off-Talk detection. The EEG signal and its different frequency bands have been employed in some applications, such as seizure de- tection, emotion recognition, and even speech recognition. Ocak [4] analyses the frequency bands between 0 Hz – 86.8 Hz using wavelet transform, and reports that the higher bands between 43.4 Hz – 86.8 Hz provides the optimum ac- curacy for detection of epileptic seizures. Adeli et al [5] use a wavelet chaos methodology to detect seizure using EEGs and EEG sub-bands and analyse EEG signals between 0 Hz – 60 Hz. Petrantonakis et al [6] use the lower frequency bands between 8 Hz – 12 Hz and 13 Hz – 30 Hz for emotion recog- nition. The EEG signal has also been used for the speech recognition of unspoken words where Porbadnigk et al [7] recorded the 16 EEG signal channels with a 128 cap mon- tage and recognised ﬁve words with an average accuracy of 45.50%. The most prominent band of the EEG signal lies in the lower frequencies (Alpha band for attentional demands and Beta band for emotional and cognitive processes) [8], but these bands may contain noise of muscle activity which makes it difﬁcult to measure only neuronal activity in the bands during speech articulation, as speech articulation re- sults in muscle activities. Muscle activity can introduce noise in EEG signals (e.g., peak frequency of masseter muscles movements are in 50 Hz – 60 Hz range, and frontalis muscles movements are between 30 Hz – 40 Hz), and the noise band limit is between 15 Hz – 100 Hz [9]. Kumar et al [10] also report a noise range for frontalis muscles between 20 Hz –