Selecting Loss-function for unseen (Out-of-distribution) Action Recognition in Videos Hassanat Lodhi Department of Computer Science COMSATS University Islamabad Abbottabad, Pakistan lodhihasanat@gmail.com Aymen Qureshi Department of Computer Science COMSATS University Islamabad Abbottabad, Pakistan ayment.qureshi3@gmail.com Areej Bashir Department of Computer Science COMSATS University Islamabad Abbottabad, Pakistan areejbashir@cuiatd.edu.p Muhammad Jawad Department of Computer Science COMSATS University Islamabad, Lahore Campus Lahore, Pakistan mjawad@cuilahore.edu.pk Muhammad U. S. Khan Department of Computer Science COMSATS University Islamabad, Abbottabad Campus Abbottabad, Pakistan ushahid@cuiatd.edu.pk Abstract—Recognizing what is happening in a movie is an essential aspect of various software, such as video analysis, video surveillance, and human-computer interaction. However, the presence of out-of-distribution (OOD) samples can often hamper the performance of action recognition systems. The purpose of this work is to develop a system that can detect human actions in video sequences along with detecting OOD samples. We propose a simple model that uses I3D features to identify the actions in the video. To detect the unseen (OOD) samples we analyzed various loss-functions that drops the conﬁdence of model more on the unseen samples as compared to seen samples. We evaluated our system’s performance on the UCF101 dataset. The results show that proposed model detect seen actions with 91% whereas unseen actions with 80% accuracy. This research opens up new possibilities for creating powerful action recognition systems that can operate effectively in real-world environments. Index Terms—component, formatting, style, styling, insert I. I NTRODUCTION For decades, the research in computer vision has focused on object detection and action recognition, with improved perfor- mance achieved through deep learning models [1]. These deep learning models are beneﬁcial for a wide range of applications, such as VQA [2], visual reasoning [3], and automatic driving. The traditional learning-based objection detection techniques assumes a closed set of known classes; however, in reality, there exist an inﬁnite number of unknown classes that must be generalized. The phenomenon is called as an “Open Set Classiﬁcation Problem” by the Scheirer et al. [4]. In practical applications with unknown classes, it is important for object detection algorithms to conﬁdently identify unknown objects and assign known objects to the correct class. Dhamija et al. [5] state that even by using large training set, many state-of- the-art object detectors have high false positive rate. Therefore, it is necessary to categorize unknown objects into known classes. Compared to the closed-world setting for static learning, the Open-World Object Detection setting feels more appropriate because the number, type, and layout of new classes in the world change over time. However, object detection alone may not sufﬁce for applications requiring a deep understanding of human behavior and activities, which is where action recognition comes in [6]. The identiﬁcation of human behaviors and activities in video streams, known as action recognition, ﬁnds use in numerous applications such as video surveillance, human-machine inter- action, and sports analysis. The recent advancements in deep learning have paved the way for more accurate and robust ac- tion recognition models, such as ”Two-Stream” Convolutional Neural Network (CNN) [7]. This model leverages CNNs to analyze the spatial and temporal features of video frames and combine them for ﬁnal predictions. Other widely-used action recognition models include the SlowFast networks and 3D CNNs [8] and [9]. Action recognition models are effective at analyzing human behaviors and activities in video sequences by assuming activities fall within the range of the training dataset. However, in real-life circumstances, activity distributions can change, and the model may encounter activities that are not part of its training. This can lead to unreliable and inaccurate results that can have serious implications in real-world settings. Out- of-distribution (OOD) samples impede the performance of action recognition models, and Out-of-Distribution detection is necessary to address this issue. The OOD detection is the model’s ability to identify activities that are outside the training distribution. This can be accomplished using different methods, such as uncertainty estimation, anomaly detection, and generative modeling. This paper introduces a suggested system aiming to bridge the gap mentioned above. Instead of building the deep complex model, we study the performance of different loss-functions with simple models in activity recognition of both seen and unseen actions. We created an out-of-distribution detector with a simple dense layer model using I3D features of videos as