SUBMITTED TO ELSEVIER SCIENCE 1 Unsupervised Analysis of Everyday Human Activities Using Sufﬁx Trees Raffay Hamid, Siddhartha Maddi, Aaron Bobick, Irfan Essa College of Computing, Georgia Institute of Technology Atlanta, GA 30332-0280 USA {raffay, maddis, afb, irfan}@cc.gatech.edu Abstract—Formalizing computational models for everyday human activities remains an open challenge. Many traditional approaches to this end assume prior knowledge about the structure of activities, based on which explicitly deﬁned models are learned in a supervised manner. For a majority of everyday environments however, the structure of in situ activities is not known a priori. In this paper we investigate knowledge representations and manipulation techniques that facilitate learning of human activities with minimal supervision. The key contribution of this work is the idea that global structural information of human activities can be encoded using a subset of their variable-length event subsequences, and that this encoding is sufﬁcient for activity-class discovery and classiﬁcation. In particular, we propose the usage of a data structure called Sufﬁx Trees as an activity-representation to efﬁciently encode structure of activities over multiple temporal scales. We prove how the feature-space induced by Sufﬁx Trees is representationally superior to some of the previously proposed approaches, and compare such approaches with Sufﬁx Trees in terms of their discriminative power, noise sensitivity and unsupervised activity-class discovery. Exploiting properties of Sufﬁx Trees, we present a novel by-parts perspective on ﬁnding anomalous subsequences of activities, and propose a linear-time algorithm for their efﬁcient detection. Furthermore, we present a mechanism to automatically parse unsegmented and interspliced activities in a stream of detected events. Comparative results on various kitchen activities collected from multiple human subjects are presented to demonstrate the competence of our framework. Index Terms—Perceptual Reasoning, Video Analysis, Computer Vision, Concept Learning. ✦ 1 I NTRODUCTION C ONSIDER a household kitchen where a lot of different activities can take place. These include making omelets, washing dishes, or eating cereal etc. Each one of these activities can be performed in many different ways. To build computational systems that can be useful in such everyday environments, it is not plausible to learn each and every one of the in situ activities in a completely supervised manner. We are therefore interested in systems that can learn to recognize human activities with minimal supervision. These systems can potentially offer a variety of applications, which include helping monitor peoples’ health as they age, ﬁghting crime through improved surveillance, and building smarter robots. One of the key challenges in building such perceptual systems is the gap that exists between the low-level sensory inputs, such as pixel values or microphone voltages, and the higher level inferences, such as what dish is being prepared in a kitchen, or whether someone forgot to add salt in it. A natural way to bridge this gap is to have a set of intermediate characterizations that can appropriately channel the low-level perceptual information to the higher level inference stage. The granularity at which these intermediate characterizations should be deﬁned presents a trade-off between how expressive the characterizations are, versus the robustness with which they can be detected through low-level sensory data. In the following, we deﬁne a set of such intermediate characteriza- tions that we shall use throughout this paper. Person Sink Fridge Enter/Exit Stove Table Washer Shelf 1 Shelf 2 Shelf 3 Fig. 1. Illustration of an Example Event - A person shown washing some dishes in the sink of a kitchen. 1.1 Elements of Activity Dynamics One way of looking at everyday environments is in terms of a set of perceptually detectable key-objects [1]. A key-object may be deﬁned as: Key-object: An object present in an environment that provides functionalities required for the execution of various activities of interest in that environment. We assume that a list of key-objects for an environment is known a priori. An illustrative ﬁgure showing a list of key- objects in a kitchen environment is shown in Figure 1. Various operations on the key-objects can be used to deﬁne a set of perceptually detectable activity-descriptors. We call these descriptors Events which are deﬁned as: Event: A particular interaction amongst a subset of key- objects over a ﬁnite duration of time. Figure 1 shows an example event of a person washing utensils.