Information Processing and Management 62 (2025) 104094 Available online 17 February 2025 0306-4573/© 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies. Contents lists available at ScienceDirect Information Processing and Management journal homepage: www.elsevier.com/locate/ipm Hierarchical chat-based strategies with MLLMs for Spatio-temporal action detection ✩ Xuyang Zhou a , Ye Wang a ,∗ , Fei Tao b , Hong Yu a , Qun Liu a a Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, China b Shanghai Feathervibe Tech, Shanghai, China A R T I C L E I N F O Keywords: Spatio-temporal action detection Multimodal large language model Prompts learning Chain of thoughts A B S T R A C T Spatio-temporal action detection (STAD) in football matches is challenging due to the subtle, fast-paced actions involving multiple participants. Multimodal large language models (MLLMs) often fail to capture these nuances with standard prompts, producing results lacking the detailed descriptions needed to improve visual features. To address this issue, we propose a prompt strategy called Hierarchical Chat-Based Strategies (HCBS). Specifically, this strategy enables MLLMs to form a chain of thought (CoT), gradually generating content with increasingly detailed information. We conduct extensive experiments on three datasets: 126 videos from Multisports, 43 videos from J-HMDB, and 147 videos from UCF101-24, all focus on the football sections. Compared to baseline tasks, our method improves performance by 30.3%, 26.1%, and 25.5% on these three datasets, respectively. Through the experiment of Hierarchy Verification, we demonstrate that HCBS effectively guides MLLMs in generating hierarchical descriptions. Additionally, using HCBS to guide MLLMs in content generation, we create a frame-level description dataset with 120,511 frame descriptions across the three datasets. Our code and dataset are available at the following link: https://github.com/TristanAlkaid/HCBS/. 1. Introduction Spatio-temporal action detection (STAD) is an essential and highly challenging task in video understanding (Ding et al., 2024; Zhang et al., 2020). This task is academically valuable and plays an important role in various practical applications. For example, in sports event analysis, STAD can be used to identify key game segments, enabling more precise analysis of critical moments and improving decision-making in real time. In intelligent security, STAD can help automatically detect abnormal behaviors, leading to faster and more accurate identification of potential threats and enhancing overall safety. Furthermore, STAD proves valuable in assistive healthcare by analyzing patient behavior patterns, facilitating enhanced diagnostic accuracy, improving early detection of health issues, and supporting rehabilitation efforts with personalized intervention strategies (Yan et al., 2023). The early research in video understanding tasks indicates that action video tasks involve processing time-series data. The key aspect of this task is to extract temporal features of actions. The complexity of this process is further increased by factors such as varying perspectives and substantial background environment variations (Jiang et al., 2023; Rani & Kumar, 2020). Building on these initial classification tasks, object detection extends requirements not only by predicting the target category but also by ✩ This work is partly supported by the National Natural Science Foundation of China (62136002, 62221005 and 62306056), the Natural Science Foundation of Chongqing, China (cstc2022ycjh-bgzxm0004 and CSTB2023NSCQ-LZX0006). ∗ Corresponding author. E-mail addresses: 2021212778@stu.cqupt.edu.cn (X. Zhou), wangye@cqupt.edu.cn (Y. Wang), taofei@feathervibe.com (F. Tao), yuhong@cqupt.edu.cn (H. Yu), liuqun@cqupt.edu.cn (Q. Liu). https://doi.org/10.1016/j.ipm.2025.104094 Received 23 October 2024; Received in revised form 13 January 2025; Accepted 6 February 2025