SIMULTANEOUS BEAM SELECTION AND USERS SCHEDULING EVALUATION IN A VIRTUAL WORLD WITH REINFORCEMENT LEARNING Ilan Correa 1 , Ailton Oliveira 1 , Bojian Du 2 , Cleverson Nahum 1 , Daisuke Kobuchi 2 , Felipe Bastos 1 , Hirofumi Ohzeki 2 , João Borges 1 , Mohit Mehta 3 , Pedro Batista 4 , Ryoma Kondo 2 , Sundesh Gupta 3 , Vimal Bhatia 3 , and Aldebaro Klautau 1 1 Universidade Federal do Pará ‑ LASSE — www.lasse.ufpa.br, Av. Perimetral S/N, Belém, Pará , Brazil, 2 Team MLAB‑RL, Morikawa Narusue Laboratory, The University of Tokyo, Japan, 3 Team IITI‑RL, Indian Institute of Technology Indore, India, 4 Ericsson Research, 164 80 Stockholm, Sweden NOTE: Corresponding author: Ilan Correa, ilan@ufpa.br Abstract The ϔifth generation of mobile networks evolved to serve applications with distinct requirements, which results in a high management complexity due to simultaneous real‑time tasks. In the physical layer, code words that allow proper data exchange between the Base Station (BS) and the served users must be chosen. While, in higher layers, the BS must choose users to be served in a given transmission opportunity. There are approaches based on Machine Learning (ML) to solve these combined tasks. However, due to the high amount of possible inputs, a challenge is the availability of data to train the models. In some cases, there may not even exist a predeϔined optimal answer to use as a “label” for supervised approaches. In this paper, we evaluate solutions for the combined problems of beam selection and user scheduling with Reinforcement Learning (RL), which does not need labels, as a solution for problems without a predeϔined answer. The algorithms were proposed for Problem Statement 6 of the challenge organized by the International Telecommunication Union (ITU) in 2021, which ranked as the ϔinalists. We compare the approaches in relation to the cumulative reward received by the agents and show a performance comparison of different RL approaches by comparing them with baselines developed for the challenge. The paper also shows how the action taken by the trained agents affect network operation by comparing the number of packets transmitted, which is highly related to the proper selection of users and code words. Keywords – Beam selection, reinforcement learning, user scheduling, virtual world 1. INTRODUCTION The ϐifth‑generation (5G) and beyond of the mobile wire‑ less communications envisages, among other features, higher data rates with the use of greater bandwidths. Due to the scarcity of available spectrum at the cur‑ rently mostly used sub‑6 GHz frequencies, wider band‑ widths are being reserved for mobile communications at millimeter Wave (mmWave) bands, such as 28 GHz and 60 GHz [1]. A drawback of the mmWave bands is the higher attenuation in comparison to sub‑6 GHz frequen‑ cies. Thus, Multiple‑Input Multiple‑Output (MIMO) tech‑ niques are among the core technologies of 5G develop‑ ment at mmWave bands, since they provide better direc‑ tionality of the electromagnetic wave, allowing to circum‑ vent the high path attenuation [2]. MIMO can also allow increasing system capacity over the same available time‑ frequency resources, increasing signiϐicantly the spectral efϐiciency [3]. On top of the previously described MIMO‑based Physi‑ cal Layer (PHY), the Base Station (BS) must perform efϐiciently the so‑called Radio Resource Allocation (RRA) [4] or users scheduling to serve the users. These devices can be classiϐied into one or more use cases of the 5G networks, namely: enhanced Mobile Broadband (eMBB), Ultra‑Reliable Low‑Latency Communications (URLLC), and massive Machine Type Communications (mMTC). In other words, these networks must serve devices with very distinct requirements, such as the Internet of Things (IoT), terrestrial vehicles, Unnamed Aerial Vehi‑ cles (UAVs), pedestrians, and infrastructure. Further‑ more, users’ mobility and interactions with the environ‑ ment make the task even harder to solve. These devices may have data available from several sen‑ sors, which could eventually be available to optimize the MIMO and the user scheduling operations [5, 6]. As an example, there is a trend toward autonomous vehicles, which can take advantage of the increasing connectivity options, resulting in applications such as Vehicle‑to‑Vehicle (V2V), Vehicle‑to‑Infrastructure (V2I), and Vehicle‑to‑everything (V2X) communications. These vehicles can deploy devices, such as cameras, Light de‑ tection and ranging (Lidar), Global Navigation Satellite System (GNSS), etc. The sensors are related, for exam‑ ple, to detection of pedestrians and other vehicles, inter‑ pretation of signaling on the streets, automatic and semi‑ automatic driving, and so on. Given this possible high amount of data and the increasing difϐiculty of the several tasks performed in the network, ML techniques have been adopted in several works [7, 8], especially Deep Neural Networks (DNNs). DNNs are optimized for an application, in general, with supervised learning approaches, which may require a prohibitive ITU Journal on Future and Evolving Technologies, Volume 3, Issue 2, September 2022 ©International Telecommunication Union, 2022 Some rights reserved. This work is available under the CC BY-NC-ND 3.0 IGO license: https://creativecommons.org/licenses/by-nc-nd/3.0/igo/. More information regarding the license and suggested citation, additional permissions and disclaimers is available at: https://www.itu.int/en/journal/j-fet/Pages/default.aspx