This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE SYSTEMS JOURNAL 1 Decentralized Learning for Opportunistic Spectrum Access: Multiuser Restless Multiarmed Bandit Formulation Himanshu Agrawal and Krishna Asawa Abstract—In opportunistic spectrum access, each secondary user selects a channel from a pool of multiple channels based on their local observations. The challenge here is to learn the best channel in terms of availability, as the channel availability statistics are unknown. In order to learn these unknown statistics, a novel decentralized multiuser learning technique termed as DSEE for channel selection in dynamic networks (DSEE-CSDN) has been proposed. DSEE-CSDN allows secondary users to enter the net- work during different time slots. Thus, the number of secondary users is not known beforehand. Moreover, the availability status of different independent channels is considered to be changing according to the two-state restless Markov chain model, which, in practice, is more realistic as compared to independent and identically distributed channel state model. Thus, the problem is formulated as a stochastic multiuser restless multiarmed bandit. The proposed algorithm achieves system-wide order-optimal per- formance under self-play. Results indicate that DSEE-CSDN is able to achieve a logarithmic order of regret. Furthermore, collisions and switching cost are just around 5% and 2% of total time slots, respectively. Also, DSEE-CSDN can achieve probabilistic fairness in channel selection without any preagreement among users. Index Terms—Opportunistic spectrum access (OSA), decentralized algorithms, cognitive radio (CR), reinforcement learning, restless multiarmed bandits (MABs). I. INTRODUCTION T HE demand for electromagnetic radio spectrum has increased exponentially in the last decade due to the in- troduction of new technologies, such as device-to-device com- munication and access paradigms, such as long-term evolution (LTE) and LTE-advanced networks [1]. The spectrum is a natural and limited resource; thus, its efficient utilization is the only option. It has been shown in [2] and [3] that the radio spectrum is massively underutilized with respect to time, frequency, and location. Cognitive radio (CR) overcomes the abovementioned limitations by accessing instantly available bands [4]. It senses the surrounding environment to collect information and re- configure its parameters, such as transmission power, carrier frequency, modulation techniques, etc. This solution is known as opportunistic spectrum access (OSA) [5]. Manuscript received February 5, 2019; revised July 22, 2019; accepted August 31, 2019. (Corresponding author: Himanshu Agrawal.) The authors are with the Department of Computer Science and Engineer- ing, Jaypee Institute of Information Technology, Noida 201304, India (e-mail: himanshu.agrawal@jiit.ac.in; krishna.asawa@jiit.ac.in). Digital Object Identifier 10.1109/JSYST.2019.2943361 The approach of OSA using CR considers two type of users: first, primary user (PU) or licensed user, able to transmit their data at any instance; and second, secondary user (SU) or un- licensed user, can start the transmission when the channel is free (not occupied by PU) [6]. SUs in the network sense a part of the licensed electromagnetic spectrum occupied by the PUs to identify available frequencies for transmission. Based on geographic location and activity patterns of PUs, some parts of the spectrum are more likely to be available than others. The available frequency bands can offer a better quality-of-service (QOS) in terms of data rate, less interference, and delay. How- ever, to identify such frequency bands, the SU has to learn availability statistics of all the channels. The aim is to identify idle bands (due to inactive PUs) and use them for transmission without causing any harmful interference to the licensed users. There are two different scenarios in which OSA can be formu- lated: centralized and decentralized. In a centralized scenario [7], [8], a central controller is required to assign different channels to devices; however, it incurs high communication costs and is prone to single node failure. Whereas, in a decentralized scenario, there are two different approaches. In the first ap- proach, there is information exchange among SUs [9], [10]. In the second approach, the network consists of independent, selfish, and noncooperative users, which operate temporarily, so there is no interuser information exchange [11], [12]. Users 1 are not even aware of the number of users in the network. Different distributed users select channels based on their local observations and channel availability history. This phenomenon of distributed learning and multiple access can be modeled as a stochastic multiuser restless multiarmed bandit (MURMAB) problem. The goal of an effective decentralized policy is to identify top M channels as early as possible and orthogonalize M users on it perfectly without any preagreement or information exchange. To satisfy these constraints, a channel selection policy (CSP) based on learning is proposed, termed as deterministic se- quencing of exploration and exploitation for channel selection in dynamic networks (DSEE-CSDN). It learns the number of users and mean availability of channels to identify the M -best channels. The proposed policy will allow all users to share and access the best channels. The major contributions of this article are as follows. 1 A user refers to an “SU” unless otherwise mentioned. 1937-9234 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.