Action MACH A Spatio-temporal Maximum Average Correlation Height Filter for Action Recognition Mikel D. Rodriguez * , Javed Ahmed † , Mubarak Shah * Computer Vision Lab, University of Central Florida. Orlando, FL. * Military College of Signals. Rawalpindi, Pakistan. † Abstract In this paper we introduce a template-based method for recognizing human actions called Action MACH. Our ap- proach is based on a Maximum Average Correlation Height (MACH) ﬁlter. A common limitation of template-based methods is their inability to generate a single template us- ing a collection of examples. MACH is capable of captur- ing intra-class variability by synthesizing a single Action MACH ﬁlter for a given action class. We generalize the traditional MACH ﬁlter to video (3D spatiotemporal vol- ume), and vector valued data. By analyzing the response of the ﬁlter in the frequency domain, we avoid the high com- putational cost commonly incurred in template-based ap- proaches. Vector valued data is analyzed using the Clifford Fourier transform, a generalization of the Fourier trans- form intended for both scalar and vector-valued data. Fi- nally, we perform an extensive set of experiments and com- pare our method with some of the most recent approaches in the ﬁeld by using publicly available datasets, and two new annotated human action datasets which include actions performed in classic feature ﬁlms and sports broadcast tele- vision. 1. Introduction Action recognition constitutes one of the most challeng- ing problems in computer vision, yet effective solutions ca- pable of recognizing motion patterns in uncontrolled en- vironments could lend themselves to a host of important application domains, such as video indexing, surveillance, human-computer interface design, analysis of sports videos, and the development of intelligent environments. Temporal template matching emerged as an early so- lution to the problem of action recognition, and a gamut of approaches which fall under this general denomination has been proposed over the years. Early advocates for ap- proaches based on temporal matching, such as Polana and Figure 1. Our framework is capable of recognizing a wide range of human actions under different conditions. Depicted on the left are a set of publicly available datasets which include dancing, sport activities, and typical human actions such as walking, jumping, and running. Depicted on the right column are examples of two action classes (kissing and slapping) from a series of feature ﬁlms. Nelson [17], developed methods for recognizing human motions by obtaining spatio-temporal templates of motion and periodicity features from a set of optical ﬂow frames. These templates were then used to match the test samples with the reference motion templates of known activities. Essa and Pentland [9] generated spatio-temporal templates based on optical ﬂow energy functions to recognize facial action units. Bobick et al [5] computed Hu moments of motion energy images and motion-history images to cre- ate action templates based on a set of training examples which were represented by the mean and covariance ma- trix of the moments. Recognition was performed using the Mahalanobis distance between the moment description of the input and each of the known actions. Efros et al. [8] proposed an approach to recognizing hu- man actions at low resolutions which consisted of a motion descriptor based on smoothed and aggregated optical ﬂow