1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2872862, IEEE Sensors Journal 1 Action Detection and Recognition in Continuous Action Streams by Deep Learning-Based Sensing Fusion Neha Dawar, Student Member, IEEE, and Nasser Kehtarnavaz, Fellow, IEEE Abstract—This paper presents a deep learning-based sensing fusion system to detect and recognize actions of interest from continuous action streams which contain actions of interest occurring continuously and randomly among arbitrary actions of non-interest. The sensors used in the fusion system consist of a depth camera and a wearable inertial sensor. A convolutional neural network is utilized for depth images obtained from the depth sensor and a combination of convolutional neural network and long short term memory network is utilized for inertial signals obtained from the inertial sensor. Each sensing modality ﬁrst performs segmentation of all actions and then detection of actions of interest for a particular application. A decision-level fusion of the two sensing modalities is carried out to achieve the recognition of the detected actions of interest. The developed fusion system is examined for two applications: one involving transition movements for home healthcare monitoring and the other involving smart TV hand gestures. The results obtained show the effectiveness of the developed fusion system in dealing with realistic continuous action streams. Index Terms—Deep learning-based continuous action detection and recognition, fusion of depth and inertial sensing, action detection and recognition in continuous action streams. I. I NTRODUCTION H UMAN action or gesture recognition has enabled natural interfacing between humans and computers, and has already found its way into consumer electronics products. Many applications have beneﬁtted from human action or gesture recognition. For example, human action recognition has been increasingly used for activity monitoring of the elderly population in home environments to address the steady increase in healthcare costs [1]. Different sensing modalities including RGB cameras, e.g. [2], [3], depth cameras, e.g. [4], [5] and inertial sensors, e.g. [6], [7] have been mostly utilized individually for human action or gesture recognition. As discussed in our previous works [8]–[10], action or gesture recognition can be made more robust by fusing decision from two differing modality sensors as compared to a single modality sensor. In the great majority of works reported in the literature on action or gesture recognition, actions or gestures of interest are already segmented from action streams. To operate a human computer interaction system in a real-world setting, it is required that the actions of interest are detected from unseen continuous action streams in which they occur randomly and The authors are with Department of Electrical and Computer Engineer- ing, University of Texas at Dallas, Richardson, TX 75080 USA (e-mail: neha.dawar@utdallas.edu, kehtar@utdallas.edu). continuously amongst arbitrary actions of non-interest or no actions. This real world setting is by far a more challenging scenario as compared to the scenario where action streams are segmented manually such that segments contain only one action of interest. Detection of actions of interest from contin- uous action streams requires ﬁrst segmenting all possible ac- tions, regardless whether they are actions of interest or actions of non-interest, followed by identifying and classifying the actions of interest for a particular application. In our previous works [11]–[13], several fusion approaches were developed to detect and recognize smart TV gestures from continuous action streams by using skeleton joint positions obtained from a depth camera and inertial signals obtained from an inertial sensor. In [14], a data ﬂow synchronization technique was developed to enable the real-time implementation of our fusion approaches. Most of the previously developed fusion systems use handcrafted features together with classiﬁers such as Hidden Markov Model (HMM), Collaborative Representation Classi- ﬁer (CRC), and Maximum Entropy Markov Model (MEMM) [11]–[15]. With the growing popularity of deep learning neural networks due to their high performance in various recognition tasks, in particular Convolutional Neural Networks (CNN) [16] and Long Short Term Memory (LSTM) networks [17], a CNN+LSTM-based fusion system to automatically detect and recognize actions of interest from continuous action streams has been developed in this work. The developed fusion system is used to detect actions of interest from continuous action streams for two applications including human body transition movements monitoring and smart TV hand gesture recognition. The actions of interest in the transition movements monitoring application involve transitions between the body states of sitting, standing and lying down. Considering the importance of fall detection monitoring for elderly and patients [18], in addition to the transition movements, falls are also monitored and detected here. The fusion system developed in this paper utilizes a depth camera and a wearable inertial sensor simultaneously to perform continuous action detection and recognition. Unlike video cameras, depth cameras do not provide identifying facial information thus avoiding any privacy concern. A continuous action dataset is also made available in this paper for public use. This dataset consists of synchronized depth images and inertial signals associated with body transition movements as well as falls that are performed in a continuous and random manner in between various actions of non-interest. In addition to this dataset, our continuous action dataset (named UTD-