Feasibility Analysis of Symbolic Representation for Single-Channel EEG-Based Sleep Stages Zheng Chen 1 , Pei Gao 1 , Ming Huang 1,* , Naoaki Ono 1,2 , MD Altaf-Ul-Amin 1 , and Shigehiko Kanaya 1,2 Abstract— Sleep screening based on the construction of sleep stages is one of the major tool for the assessment of sleep quality and early detection of sleep-related disorders. Due to the inherent variability such as inter-users anatomical variability and the inter-systems differences, representation learning of sleep stages in order to obtain the stable and reliable characteristics is runoff for downstream tasks in sleep science. In this paper, we investigated feasibility of the EEG- based symbolic representation for sleep stages. By combining the Latent Dirichlet Allocation topic model and comparing with different feature extraction methods, the work proved the feasibility of multi-topics representation for sleep stages and physiological signals. I. I NTRODUCTION Sleep is the corner stone for healthiness and well-being throughout our life. Getting adequate sleep at nights can help protect our mental health, physical health, and quality of life [1]. Sleep screening based on sleep stages is one of the major tool in assessment of sleep-related disorders, such as sleep apnea syndrome, schizophrenia, depression, insomnia, narcolepsy, and other neural abnormalities. The gold standard for sleep construction is re-defined to five dif- ferent stages, i.e., wake, rapid eye movement (REM) or non- REM where non-REM stage can be further divided into N1, N2, and N3 according to the American Academy of Sleep Medicine (AASM) [2]. Meanwhile, the stage scoring remains the multi-lead electroencephalogram (EEG) recording by overnight polysomnography (PSG) with manual labeling by sleep experts [3]. The sleep has informative frequency oscil- lation of EEG waves in 0.5 to 30-35 Hz range. Wakefulness is characterized by alpha (8-12 Hz) and beta frequency rhythms (16-30 Hz). The alpha frequency occupies more than 50% of the epoch for N1 while theta waves (4-8 Hz) are concomitant. N2 corresponds to the epoch in which the theta waves are also noticeable. Meanwhile the sleep spindles and K-complex appear in this stage. N3 refers to a deep sleep (or slow-wave sleep) interval that the presence of delta activity (0–4 Hz) for more than 20% of the epoch is classified as N3 [1]. In REM period, the epoch is scored when saw-tooth waves (or This research and development work was supported by a Grant-in-aid for Young Scientists of the Japan Society for the Promotion of Science (JSPS) #20k19923 Zheng Chen, Pei Gao, Ming Huang*, Naoaki Ono, MD Altaf-Ul-Amin, and Shigehiko Kanaya are with Graduate School of Science and Technology, Nara Insitute of Science and Technology, Takayamacho 8916-5, Ikoma, 6300192 Japan. (e-mail: {chen.zheng.bn1, gao.pei.gi3, alex-mhuang, nono, amin-m, skanaya}@is.naist.jp) Naoaki Ono and Shigehiko Kanaya are with Data Science Center, Nara Insitute of Science and Technology, Takayamacho 8916-5, Ikoma, 6300192 Japan. theta waves) and saccadic eye movements are evident. The alpha waves are also predominant during REM stage. Numerous sleep-related studies are based on the assess- ment of sleep stages by using EEG recordings, for instance, analysis of insomnia disorder [4], modeling of transition mechanism [5], or developing an automatic system of sleep scoring [6], [7], [8]. In particular, the results in the literatures are promising with combining machine learning (or recent deep learning). The performance of machine learning meth- ods is heavily dependent on the choice of data representation (or features) on which they are applied [9]. Therefore, a large amount of the spur effort in deploying workflow of studies goes into the design of preprocessing pipelines, in order to obtain the stable and reliable characteristics, such as hand-crafted features [10], spectrogram [11], empirical mode decomposition [12], and feature mapping neural network [13]. Noteworthy, the large-scale patterns of synchronized neuronal activity (or EEG) are ever changing and thus exhibit a considerable variability over time [14]. This no-stationary nature in real EEG signals inevitably limits statistical data processing with time. In addition, the functional cooperative interaction of brain dynamics always has heterogeneous characteristics of inter-subject, even recording in different time for the same subject. As a consequence, exploring a dominant and reliable representation of EEG is central to understand the sleep construction and to making optimal data-driven strategies for downstream tasks. One representation that the data mining community has been considered transforming real valued data into sym- bolic representations, noting such representations would po- tentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and the machine learning [15]. Moreover, such studies have more recent attention in the sleep stage analysis. Herrera et al., proposed the application of a novel method for symbolic representation of the EEG and evaluated its potential as information source for a sleep stage classifier [16]. To meet the criticism and reveal the latent sleep states, Koch et al., utilized symbolic aggregate approximation (SAX) to trans- form the sleep epoch of EEG to a mixture of probabilities of latent sleep states and developed an automatic sleep classifier using the Latent Dirichlet Allocation (LDA) topic model [17]. Christensen et al. inspired the idea of Koch et al. and used the same method to analyze the sleep EEG of people with insomnia disorder with a frequency-based sleep analysis procedure, which is describing each epoch as a mixture vigilance states [18]. However, the proposed SAX 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) Oct 31 - Nov 4, 2021. Virtual Conference 978-1-7281-1178-0/21/$31.00 ©2021 IEEE 5928