Action recognition in real-world videos Waqas Sultani, Information Technology University, Pakistan Qazi Ammar Arshad, Information Technology University, Pakistan Chen Chen, University of North Carolina at Charlotte, USA Related Concepts Action recognition and localization Action based video summarization Definition The goal of human action recognition is to temporally or spatially localize the human action of interest in video sequences. Temporal localization (i.e . indicat- ing the start and end frames of the action in a video) is referred to as frame-level detection. Spatial localization, which is more challenging, means to identify the pixels within each action frame that correspond to the action. This setting is usually referred to as pixel-level detection. In this chapter, we are using action, activity, event interchangeably. Background Three main ingredients of action research are visual features, machine learn- ing methodology, and datasets. Recent years have witnessed a tremendous in- crease in research and development in all these areas of research. Several new vi- sual features have been proposed which range from handcrafted local and global features and deeply learned visual features for action recognition and detection. Almost all the machine learning techniques have been applied to achieve ro- bust action classification. Most of the action classifications methods gear around supervised approach [11,5,10]. Since obtaining labels of videos for the super- vised approach is quite a time consuming and costly, several weakly supervised [3,16,36,23,40] and unsupervised approaches [29,38,15] have been proposed. The availability of diverse and real-world representative datasets plays a cru- cial role in research and development in any field. Several large scales, diverse, real-world representative datasets have been introduced in recent years. These datasets include videos from sports , movies, daily lives and person-environment interaction videos [9,24,1,21,18,6,7,30,13]. In what follows, we provide a brief review of some of the very important visual features techniques, machine learning approaches to learn action classifiers and some of the recent action datasets. Visual Features for Action Recognition To recognize and localize human action in videos, several recent visual fea- tures have been proposed. Good visual features are invariant to scale, rotation, affine transformation, brightness changes, occlusion and camera motion, and po- sition. Overall, there are two types of features, i.e., handcrafted features and fea- tures learned through deep networks. In handcrafted features, there are further arXiv:2004.10774v1 [cs.CV] 22 Apr 2020