On Recognizing Actions in Still Images via Multiple Features Fadime Sener 1 , Cagdas Bas 2 , and Nazli Ikizler-Cinbis 2 1 Computer Engineering Department, Bilkent University, Ankara, Turkey 2 Computer Engineering Department, Hacettepe University, Ankara, Turkey Abstract. We propose a multi-cue based approach for recognizing hu- man actions in still images, where relevant object regions are discovered and utilized in a weakly supervised manner. Our approach does not re- quire any explicitly trained object detector or part/attribute annotation. Instead, a multiple instance learning approach is used over sets of ob- ject hypotheses in order to represent objects relevant to the actions. We test our method on the extensive Stanford 40 Actions dataset [1] and achieve significant performance gain compared to the state-of-the-art. Our results show that using multiple object hypotheses within multiple instance learning is effective for human action recognition in still images and such an object representation is suitable for using in conjunction with other visual features. 1 Introduction Recognizing actions in still images has recently gained attention in the vi- sion community due to its large applicability to various domains. In news pho- tographs, for example, it is especially important to understand what the people are doing from a retrieval point of view. As opposed to motion and appearance in videos, still images convey the action information via the pose of the person and the surrounding object/scene context. Objects are especially important cues for identifying the type of the action. Previous studies verify this observation [2–4] and show that identification of objects play an important role in action recognition. In this paper, we approach the problem of identifying related objects from a weakly supervised point of view and explore the effect of using Multiple Instance Learning(MIL) for finding the candidate object regions and their corresponding effect in recognition. Our approach does not use any explicit object detector, or part/attribute annotation during training. Instead, multiple object hypotheses are generated via objectness measure [5]. We then utilize a MIL classifier for learning the related object(s) amongst the noisy set of object region candidates. Besides the features extracted from candidate object regions, we evaluate various features that can be utilized for effective recognition of actions in still images. In our evaluation, we consider facial features in addition to features extracted within person regions and also features that describe the global image A. Fusiello et al. (Eds.): ECCV 2012 Ws/Demos, Part III, LNCS 7585, pp. 263–272, 2012. c Springer-Verlag Berlin Heidelberg 2012