Citation: Li, B.; Zhang, H.; He, P.;
Wang, G.; Yue, K.; Neretin, E.
Hierarchical Maneuver Decision
Method Based on PG-Option for UAV
Pursuit-Evasion Game. Drones 2023,
7, 449. https://doi.org/10.3390/
drones7070449
Academic Editor: Diego
Gonzalez-Aguilera
Received: 23 April 2023
Revised: 30 June 2023
Accepted: 4 July 2023
Published: 6 July 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
drones
Article
Hierarchical Maneuver Decision Method Based on PG-Option
for UAV Pursuit-Evasion Game
Bo Li
1
, Haohui Zhang
1
, Pingkuan He
1
, Geng Wang
1,
*, Kaiqiang Yue
1
and Evgeny Neretin
2
1
School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, China;
libo803@nwpu.edu.cn (B.L.); zhanghaohui@mail.nwpu.edu.cn (H.Z.); npuhpk@163.com (P.H.);
ykq15929955434@163.com (K.Y.)
2
School of Robotic and Intelligent Systems, Moscow Aviation Institute, 125993 Moscow, Russia;
e.s.neretin@mai.ru
* Correspondence: wanggeng@nwpu.edu.cn; Tel.: +86-133-8922-3600
Abstract: Aiming at the autonomous decision-making problem in an Unmanned aerial vehicle (UAV)
pursuit-evasion game, this paper proposes a hierarchical maneuver decision method based on the PG-
option. Firstly, considering various situations of the relationship of both sides comprehensively, this
paper designs four maneuver decision options: advantage game, quick escape, situation change and
quick pursuit, and the four options are trained by Soft Actor-Critic (SAC) to obtain the corresponding
meta-policy. In addition, to avoid high dimensions in the state space in the hierarchical model, this
paper combines the policy gradient (PG) algorithm with the traditional hierarchical reinforcement
learning algorithm based on the option. The PG algorithm is used to train the policy selector as
the top-level strategy. Finally, to solve the problem of frequent switching of meta-policies, this
paper sets the delay selection of the policy selector and introduces the expert experience to design
the termination function of the meta-policies, which improves the flexibility of switching policies.
Simulation experiments show that the PG-option algorithm has a good effect on UAV pursuit-evasion
game and adapts to various environments by switching corresponding meta-policies according to
current situation.
Keywords: UAV pursuit-evasion game; hierarchical reinforcement learning; meta-policy; policy gradient
1. Introduction
Unmanned aerial vehicles (UAVs) [1–7] are used in many fields, such as intelligent
confrontation [8], target rounding [9] and intelligent transportation [10], because of their
characteristics of being unmanned, having good concealment and having no casualties.
UAV pursuit-evasion [11] involves a game between two UAVs with competing interests.
In the process of UAV pursuit-evasion, being able to make effective maneuvering deci-
sions [12] to destroy the other side and capture the other side is the key to victory. Among
these, the real-time intelligent maneuvering decision-making ability of UAV is the core of
problem solving. The maneuvering decision-making mechanism reflects the intelligence
level of a UAV in the pursuit-evasion game. Therefore, it is necessary to design an effective
maneuvering policy in the process of the UAV pursuit-evasion game.
At present, decision algorithms in UAV pursuit-evasion mainly include differen-
tial game theory [13], influence graph method [14], heuristic search algorithm [15], etc.
F. Yu et al. [13] take into account the impact of environmental impediments in the pursuit-
evasion game between UAVs and UGVs, qualitatively assess the pursuit problem of the
difference game and use the differential game in the pursuit-evasion game. Q. Pan et al. [14]
propose a cooperative maneuver decision method for multiple unmanned aerial vehicles
based on the influence graph theory. A state predicted influence diagram model is used to
analyze elements, and an unscented Kalman filter model is used for belief state updating.
Mikhail et al. [15] propose schemes to solve the pursuit-evasion problem using Apollonius
Drones 2023, 7, 449. https://doi.org/10.3390/drones7070449 https://www.mdpi.com/journal/drones