A Multi-sensor Fusion Approach for Intention Detection Rahul Kumar Singh 1 , Rejin John Varghese 1 , Jindong Liu 1 , Zhiqiang Zhang 2 , Benny Lo 1 Abstract— For assistive devices to seamlessly and promptly assist users with activities of daily living (ADL), it is important to understand the user’s intention. Current assistive systems are mostly driven by unimodal sensory input which hinders their accuracy and responses. In this paper, we propose a context- aware sensor fusion framework to detect intention for assistive robotic devices which fuses information from a wearable video camera and wearable inertial measurement unit (IMU) sensors. A Naive Bayes classiﬁer is used to predict the intent to move from IMU data and the object classiﬁcation results from the video data. The proposed approach can achieve an accuracy of 85.2% in detecting movement intention. I. INTRODUCTION The process of translating intention into action is an intuitive, natural and seamless phenomenon for a healthy individual. However, in cases of people suffering from neu- romuscular or cerebrovascular diseases, e.g. stroke, cerebral palsy, paraplegia, limb amputation [1], and also in the case of neuromuscular weakening as seen in the elderly population, are not often able to translate the intention into action. In order to effective control exoskeletons [2], soft robotics gloves [3], and prosthetic hands [4], it is imperative to detect intention accurately and translate that into a control signal. The estimated user’s intention information could be used to generate a high-level abstract control signal (for reaching, grasping and manipulating an object) simplifying the control mechanism and enabling instant responses. The proposed system uses contextual information from a vision- based sensor (a monocular camera) which captures user’s ﬁeld of view and inertial sensors worn on the upper and lower arm which detect proprioceptive information to detect the user’s intention. II. METHODOLOGY Our vision plays a key role in motor control providing direction, guidance and feedback for upper and lower limbs movements. During any intended action, we generally try to bring the object of interest into our ﬁeld of view. Thus, hand-eye co-ordination plays a crucial role in ADL. We used YOLO (You Look Only Once) [5], a CNN (Convolutional Neural Network) based architecture for the object recogni- tion. It was trained on the Coco dataset containing 80 classes. The network predicts the object class probabilities directly from the image in a single evaluation, and also location of the 1 Rahul K Singh, Rejin J Varghese, Jindong Liu and Benny Lo are with the Hamlyn Centre, Imperial College London, SW7 2AZ, UK {r.singh17, r.varghese15, benny.lo}@imperial.ac.uk 2 Zhiqiang Zhang is with School of Electronic and Electrical Engineering, School of Mechanical Engineering, University of Leeds, Leeds, LS2 9JT, United Kingdom {z.zhang3}@leeds.ac.uk Fig. 1. System architecture showing (a) Sensor placement on the user’s body (b)&(d) Conﬁdence score calculation from camera and IMU data (c) Fusing two conﬁdence scores to predict the intended action. object in the image. Once the objects are detected from image frame, we then convert this information into the likelihood of interested object based on temporal information. The likelihood or visual intent score for the k th object denoted by P k [i] which increases with time if the object stays in the ﬁeld of view (ob j k [i]= 1) and decreases exponentially if it is not in the ﬁeld of view (ob j k [i]= 0). α cam and β cam are the rate of increase or decay constants of visual score. ob j k [i] denotes output of CNN for k th object in image frame at i th time stamp. P k0 is the prior probability value or the biased probability term representing likelihood score obtained from prior knowledge of the k th object. P k [i]=  (1 - e -x )+ P k0 e -x , if ob j k [i]= 1 P k [i - 1]e -y , if ob j k [i]= 0 (1) where, x = t[i-1] + α cam (t[i] - t[i-1]) y = β cam (t[i] - t[i-1]) P k0 = P k [i] & t [i]= 0, if, ob j k [i]= 0 Movement intention transforms into action through the movement of the upper or lower limbs. In order to counter the ‘Midas touch’ [6] problem and capture the motor intention, two IMUs - one on forearm (near wrist joint), and the other on upper arm (near elbow joint) were placed on the user. The 3-axis data from accelerometer and gyroscope data from both IMU sensors are used for classifying motion intentions. A 0.5 sec signal length with a stride of 1 on all 4 sets of data was used for feature calculation (mean and variance). The classiﬁcation was separated into 4 parallel pipelines where each pipeline gives a class output (as shown in Fig.1(d)). The features from these 4 sets of data are then fed into 4 different NB classiﬁers (namely NB 0 [i], NB 1 [i], NB 2 [i] and NB 3 [i]) denoting the same intentional action. A voting function is used to obtain the intention output from 4 NB classiﬁers. In the voting function (given by Eq. 2), each classiﬁer (NB k [i], where k=0,1,2,3) votes for a class j based on the output of