Information Processing and Management 62 (2025) 104094
Available online 17 February 2025
0306-4573/© 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Contents lists available at ScienceDirect
Information Processing and Management
journal homepage: www.elsevier.com/locate/ipm
Hierarchical chat-based strategies with MLLMs for Spatio-temporal
action detection
✩
Xuyang Zhou
a
, Ye Wang
a ,∗
, Fei Tao
b
, Hong Yu
a
, Qun Liu
a
a
Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, China
b
Shanghai Feathervibe Tech, Shanghai, China
A R T I C L E I N F O
Keywords:
Spatio-temporal action detection
Multimodal large language model
Prompts learning
Chain of thoughts
A B S T R A C T
Spatio-temporal action detection (STAD) in football matches is challenging due to the subtle,
fast-paced actions involving multiple participants. Multimodal large language models (MLLMs)
often fail to capture these nuances with standard prompts, producing results lacking the detailed
descriptions needed to improve visual features. To address this issue, we propose a prompt
strategy called Hierarchical Chat-Based Strategies (HCBS). Specifically, this strategy enables
MLLMs to form a chain of thought (CoT), gradually generating content with increasingly
detailed information. We conduct extensive experiments on three datasets: 126 videos from
Multisports, 43 videos from J-HMDB, and 147 videos from UCF101-24, all focus on the football
sections. Compared to baseline tasks, our method improves performance by 30.3%, 26.1%, and
25.5% on these three datasets, respectively. Through the experiment of Hierarchy Verification,
we demonstrate that HCBS effectively guides MLLMs in generating hierarchical descriptions.
Additionally, using HCBS to guide MLLMs in content generation, we create a frame-level
description dataset with 120,511 frame descriptions across the three datasets. Our code and
dataset are available at the following link: https://github.com/TristanAlkaid/HCBS/.
1. Introduction
Spatio-temporal action detection (STAD) is an essential and highly challenging task in video understanding (Ding et al., 2024;
Zhang et al., 2020). This task is academically valuable and plays an important role in various practical applications. For example,
in sports event analysis, STAD can be used to identify key game segments, enabling more precise analysis of critical moments and
improving decision-making in real time. In intelligent security, STAD can help automatically detect abnormal behaviors, leading
to faster and more accurate identification of potential threats and enhancing overall safety. Furthermore, STAD proves valuable in
assistive healthcare by analyzing patient behavior patterns, facilitating enhanced diagnostic accuracy, improving early detection of
health issues, and supporting rehabilitation efforts with personalized intervention strategies (Yan et al., 2023).
The early research in video understanding tasks indicates that action video tasks involve processing time-series data. The key
aspect of this task is to extract temporal features of actions. The complexity of this process is further increased by factors such
as varying perspectives and substantial background environment variations (Jiang et al., 2023; Rani & Kumar, 2020). Building
on these initial classification tasks, object detection extends requirements not only by predicting the target category but also by
✩
This work is partly supported by the National Natural Science Foundation of China (62136002, 62221005 and 62306056), the Natural Science Foundation
of Chongqing, China (cstc2022ycjh-bgzxm0004 and CSTB2023NSCQ-LZX0006).
∗
Corresponding author.
E-mail addresses: 2021212778@stu.cqupt.edu.cn (X. Zhou), wangye@cqupt.edu.cn (Y. Wang), taofei@feathervibe.com (F. Tao), yuhong@cqupt.edu.cn
(H. Yu), liuqun@cqupt.edu.cn (Q. Liu).
https://doi.org/10.1016/j.ipm.2025.104094
Received 23 October 2024; Received in revised form 13 January 2025; Accepted 6 February 2025