SIMULTANEOUS BEAM SELECTION AND USERS SCHEDULING EVALUATION IN A VIRTUAL
WORLD WITH REINFORCEMENT LEARNING
Ilan Correa
1
, Ailton Oliveira
1
, Bojian Du
2
, Cleverson Nahum
1
, Daisuke Kobuchi
2
, Felipe Bastos
1
, Hirofumi Ohzeki
2
, João
Borges
1
, Mohit Mehta
3
, Pedro Batista
4
, Ryoma Kondo
2
, Sundesh Gupta
3
, Vimal Bhatia
3
, and Aldebaro Klautau
1
1
Universidade Federal do Pará ‑ LASSE — www.lasse.ufpa.br, Av. Perimetral S/N, Belém, Pará , Brazil,
2
Team MLAB‑RL,
Morikawa Narusue Laboratory, The University of Tokyo, Japan,
3
Team IITI‑RL, Indian Institute of Technology Indore,
India,
4
Ericsson Research, 164 80 Stockholm, Sweden
NOTE: Corresponding author: Ilan Correa, ilan@ufpa.br
Abstract – The ϔifth generation of mobile networks evolved to serve applications with distinct requirements, which results
in a high management complexity due to simultaneous real‑time tasks. In the physical layer, code words that allow proper
data exchange between the Base Station (BS) and the served users must be chosen. While, in higher layers, the BS must choose
users to be served in a given transmission opportunity. There are approaches based on Machine Learning (ML) to solve these
combined tasks. However, due to the high amount of possible inputs, a challenge is the availability of data to train the models.
In some cases, there may not even exist a predeϔined optimal answer to use as a “label” for supervised approaches. In this paper,
we evaluate solutions for the combined problems of beam selection and user scheduling with Reinforcement Learning (RL),
which does not need labels, as a solution for problems without a predeϔined answer. The algorithms were proposed for Problem
Statement 6 of the challenge organized by the International Telecommunication Union (ITU) in 2021, which ranked as the
ϔinalists. We compare the approaches in relation to the cumulative reward received by the agents and show a performance
comparison of different RL approaches by comparing them with baselines developed for the challenge. The paper also shows
how the action taken by the trained agents affect network operation by comparing the number of packets transmitted, which
is highly related to the proper selection of users and code words.
Keywords – Beam selection, reinforcement learning, user scheduling, virtual world
1. INTRODUCTION
The ϐifth‑generation (5G) and beyond of the mobile wire‑
less communications envisages, among other features,
higher data rates with the use of greater bandwidths.
Due to the scarcity of available spectrum at the cur‑
rently mostly used sub‑6 GHz frequencies, wider band‑
widths are being reserved for mobile communications at
millimeter Wave (mmWave) bands, such as 28 GHz and
60 GHz [1]. A drawback of the mmWave bands is the
higher attenuation in comparison to sub‑6 GHz frequen‑
cies. Thus, Multiple‑Input Multiple‑Output (MIMO) tech‑
niques are among the core technologies of 5G develop‑
ment at mmWave bands, since they provide better direc‑
tionality of the electromagnetic wave, allowing to circum‑
vent the high path attenuation [2]. MIMO can also allow
increasing system capacity over the same available time‑
frequency resources, increasing signiϐicantly the spectral
efϐiciency [3].
On top of the previously described MIMO‑based Physi‑
cal Layer (PHY), the Base Station (BS) must perform
efϐiciently the so‑called Radio Resource Allocation (RRA)
[4] or users scheduling to serve the users. These
devices can be classiϐied into one or more use cases of
the 5G networks, namely: enhanced Mobile Broadband
(eMBB), Ultra‑Reliable Low‑Latency Communications
(URLLC), and massive Machine Type Communications
(mMTC).
In other words, these networks must serve devices
with very distinct requirements, such as the Internet of
Things (IoT), terrestrial vehicles, Unnamed Aerial Vehi‑
cles (UAVs), pedestrians, and infrastructure. Further‑
more, users’ mobility and interactions with the environ‑
ment make the task even harder to solve.
These devices may have data available from several sen‑
sors, which could eventually be available to optimize
the MIMO and the user scheduling operations [5, 6].
As an example, there is a trend toward autonomous
vehicles, which can take advantage of the increasing
connectivity options, resulting in applications such as
Vehicle‑to‑Vehicle (V2V), Vehicle‑to‑Infrastructure (V2I),
and Vehicle‑to‑everything (V2X) communications. These
vehicles can deploy devices, such as cameras, Light de‑
tection and ranging (Lidar), Global Navigation Satellite
System (GNSS), etc. The sensors are related, for exam‑
ple, to detection of pedestrians and other vehicles, inter‑
pretation of signaling on the streets, automatic and semi‑
automatic driving, and so on.
Given this possible high amount of data and the increasing
difϐiculty of the several tasks performed in the network,
ML techniques have been adopted in several works [7,
8], especially Deep Neural Networks (DNNs). DNNs are
optimized for an application, in general, with supervised
learning approaches, which may require a prohibitive
ITU Journal on Future and Evolving Technologies, Volume 3, Issue 2, September 2022
©International Telecommunication Union, 2022
Some rights reserved. This work is available under the CC BY-NC-ND 3.0 IGO license: https://creativecommons.org/licenses/by-nc-nd/3.0/igo/.
More information regarding the license and suggested citation, additional permissions and disclaimers is available at:
https://www.itu.int/en/journal/j-fet/Pages/default.aspx