This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE SYSTEMS JOURNAL 1
Decentralized Learning for Opportunistic Spectrum
Access: Multiuser Restless Multiarmed
Bandit Formulation
Himanshu Agrawal and Krishna Asawa
Abstract—In opportunistic spectrum access, each secondary
user selects a channel from a pool of multiple channels based on
their local observations. The challenge here is to learn the best
channel in terms of availability, as the channel availability statistics
are unknown. In order to learn these unknown statistics, a novel
decentralized multiuser learning technique termed as DSEE for
channel selection in dynamic networks (DSEE-CSDN) has been
proposed. DSEE-CSDN allows secondary users to enter the net-
work during different time slots. Thus, the number of secondary
users is not known beforehand. Moreover, the availability status
of different independent channels is considered to be changing
according to the two-state restless Markov chain model, which,
in practice, is more realistic as compared to independent and
identically distributed channel state model. Thus, the problem is
formulated as a stochastic multiuser restless multiarmed bandit.
The proposed algorithm achieves system-wide order-optimal per-
formance under self-play. Results indicate that DSEE-CSDN is able
to achieve a logarithmic order of regret. Furthermore, collisions
and switching cost are just around 5% and 2% of total time slots,
respectively. Also, DSEE-CSDN can achieve probabilistic fairness
in channel selection without any preagreement among users.
Index Terms—Opportunistic spectrum access (OSA),
decentralized algorithms, cognitive radio (CR), reinforcement
learning, restless multiarmed bandits (MABs).
I. INTRODUCTION
T
HE demand for electromagnetic radio spectrum has
increased exponentially in the last decade due to the in-
troduction of new technologies, such as device-to-device com-
munication and access paradigms, such as long-term evolution
(LTE) and LTE-advanced networks [1]. The spectrum is a natural
and limited resource; thus, its efficient utilization is the only
option. It has been shown in [2] and [3] that the radio spectrum
is massively underutilized with respect to time, frequency, and
location. Cognitive radio (CR) overcomes the abovementioned
limitations by accessing instantly available bands [4]. It senses
the surrounding environment to collect information and re-
configure its parameters, such as transmission power, carrier
frequency, modulation techniques, etc. This solution is known
as opportunistic spectrum access (OSA) [5].
Manuscript received February 5, 2019; revised July 22, 2019; accepted August
31, 2019. (Corresponding author: Himanshu Agrawal.)
The authors are with the Department of Computer Science and Engineer-
ing, Jaypee Institute of Information Technology, Noida 201304, India (e-mail:
himanshu.agrawal@jiit.ac.in; krishna.asawa@jiit.ac.in).
Digital Object Identifier 10.1109/JSYST.2019.2943361
The approach of OSA using CR considers two type of users:
first, primary user (PU) or licensed user, able to transmit their
data at any instance; and second, secondary user (SU) or un-
licensed user, can start the transmission when the channel is
free (not occupied by PU) [6]. SUs in the network sense a
part of the licensed electromagnetic spectrum occupied by the
PUs to identify available frequencies for transmission. Based on
geographic location and activity patterns of PUs, some parts of
the spectrum are more likely to be available than others. The
available frequency bands can offer a better quality-of-service
(QOS) in terms of data rate, less interference, and delay. How-
ever, to identify such frequency bands, the SU has to learn
availability statistics of all the channels. The aim is to identify
idle bands (due to inactive PUs) and use them for transmission
without causing any harmful interference to the licensed users.
There are two different scenarios in which OSA can be formu-
lated: centralized and decentralized. In a centralized scenario [7],
[8], a central controller is required to assign different channels
to devices; however, it incurs high communication costs and
is prone to single node failure. Whereas, in a decentralized
scenario, there are two different approaches. In the first ap-
proach, there is information exchange among SUs [9], [10].
In the second approach, the network consists of independent,
selfish, and noncooperative users, which operate temporarily, so
there is no interuser information exchange [11], [12]. Users
1
are not even aware of the number of users in the network.
Different distributed users select channels based on their local
observations and channel availability history. This phenomenon
of distributed learning and multiple access can be modeled as
a stochastic multiuser restless multiarmed bandit (MURMAB)
problem. The goal of an effective decentralized policy is to
identify top M channels as early as possible and orthogonalize
M users on it perfectly without any preagreement or information
exchange.
To satisfy these constraints, a channel selection policy (CSP)
based on learning is proposed, termed as deterministic se-
quencing of exploration and exploitation for channel selection
in dynamic networks (DSEE-CSDN). It learns the number of
users and mean availability of channels to identify the M -best
channels. The proposed policy will allow all users to share and
access the best channels. The major contributions of this article
are as follows.
1
A user refers to an “SU” unless otherwise mentioned.
1937-9234 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.