2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII) Representation Learning for Emotion Recognition from Smartphone Keyboard Interactions Surjya Ghosh *‡ , Shivam Goenka * , Niloy Ganguly * , Bivas Mitra * , Pradipta De * Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, INDIA 721302 Department of Computer Science, Georgia Southern University, USA Centrum Wiskunde & Informatica, Amsterdam, The Netherlands Email: {surjya.ghosh, shivamgoenka}@iitkgp.ac.in, {niloy, bivas}@cse.iitkgp.ac.in, pde@georgiasouthern.edu Abstract—Characteristics of typing on smartphone keyboards among different individuals can elicit emotion, similar to speech prosody or facial expressions. Existing works on typing based emotion recognition rely on feature engineering to build machine learning models, while recent speech and facial expression based techniques have shown the efficacy of learning the features automatically. Therefore, in this work, we explore the effec- tiveness of such learning models in keyboard interaction based emotion detection. In this paper, we propose an end-to-end framework, which first uses a sequence-based encoding method to automatically learn the representation from raw keyboard interaction pattern and subsequently uses this representation to train a multi-task learning based neural network (MTL-NN) to identify different emotions. We carry out a 3-week in-the- wild study involving 24 participants using a custom keyboard capable of tracing users’ interaction pattern during text entry. We collect interaction details like touch speed, error rate, pressure and self-reported emotions (happy, sad, stressed, relaxed) during the study. Our analysis on the collected dataset reveals that the representation learnt from the interaction pattern has an average correlation of 0.901 within the same emotion and 0.811 between different emotions. As a result, the representation is effective in distinguishing different emotions with an average accuracy (AUCROC) of 84%. Index Terms—Representation learning, Emotion detection, Keyboard interaction, Smartphone interaction I. I NTRODUCTION The keyboard interactions on smartphones have been re- searched as an effective modality for emotion detection [1]– [9]. However, the underlying patterns are complex enough that require extensive feature engineering to construct an accurate emotion prediction model from smartphone keyboard interactions. Some of the recent works on emotion detection based on other modalities, such as facial expressions, and speech characteristics, showed that automatic feature extrac- tion can be as effective as feature engineering [10], [11]. Hence, applying automatic feature extraction to detect patterns from smartphone keyboard interactions to build the predictive models presents itself as a promising approach. The existing literature indicates that many emotion detection techniques adopted advanced techniques such as represen- tation learning, multi-task learning (MTL) motivated by the success of deep learning in different domains [10]–[13]. For example, Ghosh et al. [13] applied representation learning to automatically extract the features from speech and glottal flow signals. Li et al. [10] proposed an attention pooling based representation learning mechanism to determine emotion from speech utterance. They used an end-to-end deep convolutional neural network (CNN) on the spectrogram extracted from speech utterances, thus overcoming the requirement of manual feature extraction. On the other hand, in existing literature, it is shown that the performance of emotion detection from acoustic signals improves when valence and arousal are mod- eled together using MTL [12]. In Emo2Vec [11], Xu et al. showed that word-level representations obtained using MTL return superior performance for different emotion related tasks (e.g. emotion analysis, stress detection) from text data. While representation learning reduces the feature engineering effort [14], MTL often returns superior performance by sharing the training knowledge among different related tasks [15], [16]. However, to the best of our knowledge, no prior work inves- tigates the effectiveness of these learning models for emotion detection from keyboard interaction pattern on smartphone. We, in this paper, propose an end-to-end framework to de- termine human emotion based on keyboard interaction pattern leveraging on the aforesaid learning algorithms. It comprises of two phases. In the first phase, we deploy a sequence-based encoder using Long Short-Term Memory (LSTM). It automat- ically learns the representation from raw keyboard interaction pattern, thus reduces the feature engineering overhead. We collate all the keyboard interactions in a typing session. The interaction details within a session like pressure, speed, duration, key type (deletion, special character, alphanumeric etc.) are fed as input to the framework to obtain the session- level representation. In the second phase, we deploy a multi- task learning (MTL) based deep neural network (DNN) model for emotion detection using the learnt representation. In MTL, learning multiple tasks together helps to share knowledge among similar tasks, thereby often yielding superior perfor- mance. In our context, emotion detection of an individual user is a separate task. As a result, the underlying similarity in keyboard interaction behavior of different users is leveraged by MTL to improve the emotion detection performance. We conduct a 3-week in-the-wild study involving 24 par- ticipants using an Android based custom keyboard, capable of tracing users’ keyboard interactions. Based on the text entry, a self-report probing mechanism collects four types of emotions (happy, sad, stressed, relaxed). We utilize the collected key- board interactions and self-report details for model construc- 978-1-7281-3888-6/19/$31.00 ©2019 IEEE