Abstract— Even though, reconfigurable intelligent surfaces (RISs) are adopted in various scenarios to enable the implementation of a smart radio environment, there are still challenging issues for its real-time operation due to the need for a costly full dimensional channel estimation with offline exhaustive search or online exhaustive beam- training. The application of the deep learning (DL) tools is favored to enable feasible solutions. In this work, we propose two low training overhead and energy efficient adversarial bandit-based schemes with outstanding performance gains compared to reference DL based reflection beamforming methods. The resulting deep learning models are also discussed using state-of-the art model quality prediction trends. Index Terms— Reconfigurable intelligent surfaces, Reflection beamforming prediction, Deep learning, Adversarial bandit, exponential-weight algorithm for exploration and exploitation, follow the perturbed leader. I. INTRODUCTION Reconfigurable intelligent surfaces (RISs) enable the control of the wireless propagation environment by smartly controlling the signal reflections via its massive low-cost elements [1-2]. Unfortunately, due to the additional channel links between the RIS and its associated transmitter and intended receivers (Figure 10 in [1] and Figure 1 in [2]), the large gain is achieved at the expense of more overhead for the channel estimation [3- 6]. However, obtaining this channel knowledge, in practice, may require large and possibly prohibitive training overhead, which represents the main challenge for the real-time RIS operation. As such, machine learning (ML) is introduced and has started to be extensively used to enhance the implementation of various components within the 5G radio access network (RAN) [4]. In [5], a joint design of transmitting beamforming (BF) matrix at the base station and the phase shift matrix at the RIS is performed by leveraging policy-based deep deterministic policy gradient (DDPG). However, the phase shifts are assumed to be continuous. The authors in [6] present a novel RIS hardware architecture along with two solutions based on compressive sensing and deep reinforcement learning with negligible training overhead. Nevertheless, the choice of the training scheme to build the dataset is not analyzed in depth. A detailed survey that introduces the interplay between AI and RIS is found in [7]. In this letter, we propose an efficient reinforcement learning based scheme to improve upon the proposed method in [6]. The main contributions are as follow • We propose an adversarial bandit approach based on exponential-weight algorithm for exploration and exploitation (EXP3). To show the merits of the proposed scheme, we conduct extensive simulations using the publicly available accurate ray-tracing based DeepMIMO datasets [8], with the ‘O1’ scenario. • To improve upon the computational complexity, the follow the perturbed leader (FPL) scheme is discussed. • To compare the quality of the state-action deep neural network models used with the reference method [6] and with the proposed ones (EXP3 and FPL), we leverage state-of-the-art techniques such as the power low (PL) exponents [9]. II. SYSTEM MODEL AND PROBLEM FORMULATION To enable the practical implementation of the RIS aided communication systems, new path loss models [2], and open-source channel models [2], [8] have been developed. As such to reproduce the results and perform a fair comparison, we will adopt the system and channel model in [6]. A. System model The transmitter-receiver communication is aided by a RIS having M reconfigurable elements. For the sake of simplicity, we assume that both the transmitter and receive are equipped with a single antenna. For generalization one can adopt the signal model from [2]. An OFDM-based transmission with K subcarriers is adopted. The direct channel per subcarrier k between the transmitter and the receiver is denoted by TR, k h  whereas links via the RIS are represented by 1 M  complex valued vectors 1 T, R, , M k k   h h . By neglecting the direct path, the received signal can be written as R, T, k T k k k k k y s n = + h Ψh , where M M k   Ψ is the RIS interaction diagonal matrix, k s and k n are the transmitted symbol per subcarrier k and the receive noise with zero mean and variance of n  . With T P being the total transmit power, the following power per subcarrier Adversarial Bandit Approach for Stand Alone RIS Operation Messaoud Ahmed Ouameur, Dương Tuấn Anh Lê, Gwanggil Jeon, Felipe A.P. De Figueiredo and Daniel Massicotte M. Ahmed Ouameur and D. Massicotte are with the Université du Québec à Trois-Rivières, Department of Electrical and Computer Eng., 3351 Boul. des Forges, Trois-Rivières, Qc, Canada, G9A 5H7. messaoud.ahmed.ouameur@uqtr.ca and daniel.massicotte@uqtr.ca D.T.A. Lê is with the Faculty of Information Technology, VNU-HCM University of Science, Vietnam. 20c14001@student.hcmus.edu.vn. G. Jeon is with the Department of Embedded Systems Engineering, College of Information Technology, Incheon National University, Incheon, South Korea. gjeon@inu.ac.kr F.A.P. De Figueiredo is with the National Institute of Telecommunications, Santa Rita do Sapucaí - Minas Gerais, Brazil. felipe.figueiredo@inatel.br