Hierarchical Models for Activity Recognition Amarnag Subramanya Dept. of Electrical Engineering University of Washington Seattle, WA 98195 Alvin Raj Dept. of Computer Science University of Washington Seattle, WA 98195 Jeff Bilmes Dept. of Electrical Engineering University of Washington Seattle, WA 98195 Dieter Fox Dept. of Computer Science University of Washington Seattle, WA 98195 Abstract— In this paper we propose a hierarchical dynamic Bayesian network to jointly recognize the activity and environ- ment of a person. The hierarchical nature of the model allows us to implicitly learn data driven decompositions of complex activities into simpler sub-activities. We show by means of our experiments that the hierarchical nature of the model is able to better explain the observed data thus leading to better performance. We also show that joint estimation of both activity and environment of a person outperforms systems in which they are estimated alone. The proposed model yields about 10% absolute improvement in accuracy over existing systems. I. I NTRODUCTION In the recent past, advances in wearable sensing and com- puting devices have made possible the ﬁne-grained estimation of a person’s activities over extended periods of time [1]. The interest in human activity recognition stems from a number of applications that rely on accurate inference of activities that a person is performing. These include, context aware computing [2] to support for cognitively impaired people [3], long-term health and ﬁtness, monitoring and automatic after action review of military missions. Bao and Intille [4] used multiple accelerometers placed on a person’s body to estimate activities such as standing, walking, or running. Kern et al [5], [6] and Lukowicz et al [7] added a microphone to a similar set of accelerometers in order to extract additional context information. One of the drawbacks of the system in [5], [7] is that they utilize multiple sensors and measurements taken all over the body. This can often lead to unwieldy systems with large battery packs. To overcome this, Lester et al [1] developed a small low-power sensor board that is mounted on a single location on the body. Once a wearable sensor system is in place, the next logical step is to design algorithms to extract pertinent features from the sensor streams, and then classiﬁers that make use of these features to infer the activities being performed. [1] also showed how to apply boosting in order to learn activity classiﬁers based on the sensor data. However, a common drawback in all previously proposed approaches is that they feed the sensor data or features into static classiﬁers [4], [2], or a bank of temporally independent HMMs [1]. Further, most of the previously proposed algorithms [1], [4] do not make a distinction between ‘complex’ and ‘simple’ activities. In practice, it might be advantageous to decompose complex activities into simpler activities that might be easier to learn. A number of ‘complex’ activities that we perform in our daily lives can be broken into smaller, simpler activities. For example, the process of driving a car involves, getting into the car, turning on the engine, driving, etc. Or getting onto an elevator, could comprise calling for the elevator, waiting for the elevator, etc. In this paper we refer to these simpler activities as sub-activities. Intuitively, it might be easier for the model to learn the simpler sub-activities rather than the complex ones. In practice though, it is not entirely clear how a given activity can be split into its constituent sub-activities, i.e., consider the car example above, we could say that when a person is in the process of turning on the engine, his motion state is stationary, on other hand since he is really sitting inside the car, his motion state could also be classiﬁed as vehicle. Thus, a statistically ideal approach would be to let the model learn the best constituent sub-activities for a given activity from the data during training. In this paper, we propose a hierarchical dynamic Bayesian network that implicitly learns these sub-activities during training. Yet another novelty of our work here is the joint estimation of both the motion state and the environment. In many situations, the type of activity that we perform is constrained by our surroundings (environment). For example, if a person were inside a building, he is very unlikely to be driving a car. Similarly, it is more likely that a person is going up/down stairs when indoors rather than when he is outdoors. In this paper, we propose a model that in addition to estimating the motion state (activity) of a person, jointly estimates his environment, i.e., whether a person is indoors, outdoors or in a vehicle. We also show how jointly estimating both the state and environment outperforms systems that estimate them independently. In addition to the above, this paper describes the models used in the ﬁrst NIST evaluations for the DARPA ASSIST project. While the models proposed here can be applied to any activity recognition task, we use automatic after-action- review (AAR) of military missions to explain the models. An AAR is essentially a summary of a military mission and is created from memory by the mission leader. It reports on various activities/incidents that took place during the mission. As the duration of the mission increases, it becomes difﬁcult for the leader to remember all the incidents to a signiﬁcant degree of detail. The proposed system is supposed to aid the leader towards creating better summaries of the mission. In our previous work on the same problem [8], [9], we have proposed algorithms to jointly infer the activity and location of a person. These systems make use of information from a GPS unit in addition to the sensors streams used in this paper.