Reinforcement Learning on Tor: Prioritizing
Performance Compromises Anonymity
Kelei Zhang
∗†
Amanul Islam
∗
, Sang-Yoon Chang
∗
∗
Computer Science, University of Colorado Colorado Springs, Colorado Springs, CO 80918
†
Department of Informatics, Fort Hays State University, Hays, KS 67601
{jzhang5,aislam2,schang2}@uccs.edu
Abstract—The onion routing (Tor) protects the IP and behav-
ioral anonymity on the Internet by routing the networking packet
through multiple randomly selected relay nodes between the
source and the destination. It therefore adds transmission delays,
and the latency cost is greater than not using Tor (i.e., when
the packet is transmitted directly to the destination). Machine
learning can be used to learn about the Tor circuit delays,
given the source and destination, to identify the circuits with
smaller latency cost. In this paper, however, we show how using
reinforcement learning in its by-default, deterministic fashion can
deprive anonymity. More specifically, choosing the performance-
best circuit corresponding to the exploitation strategy provides no
anonymity with an entropy of zero. Such approach can eliminate
the anonymity protection and performs worse than disabling Tor
altogether. We therefore recommend a strategic and appropriate
application of machine learning which not only advances the
performance but also take the anonymity into account for future
research to advance Tor and Internet anonymity.
Index Terms—Reinforcement Learning, Tor
I. INTRODUCTION
In the Tor network, user data is encrypted in multiple
layers and routed through a randomly selected circuit (a series
of relays). This practice of circuit diversity make it very
difficult for adversaries to correlate traffic and compromise
user privacy. Tor adds multiple relays between the source and
the destination nodes to make it more difficult to track the
source-destination networking for the anonymity protection.
Tor adding relays inherently incurs additional latency com-
pared to not using such anonymous routing protocol.
Reinforcement learning algorithms learn optimal actions
by iteratively estimating expected rewards for different state-
action pairs. During this process, they balance exploration
(try new actions) and exploitation (select the action known
to provide the best reward) in order to maximize cumulative
outcomes. In our problem, the action is the Tor circuit selec-
tion, and the reward is the latency performance. To highlight
the strategies’ implications on anonymity, we call the pure
exploration strategy Random Explore (always randomly select)
and the pure exploitation strategy Deterministic Best (select
the circuit having the best performance).
In this study, we apply reinforcement learning to the Tor
network and investigate the negative impact on anonymity
caused by focusing heavily on exploitation. Deterministic Best
strategy provides zero anonymity, because the same circuit gets
chosen all the time.
II. RELATED WORK
Prior works have explored various strategies to enhance
performance while balancing the inherent trade-off between
speed and anonymity. For instance, Basyoni et al. investigated
the integration of the the QUIC protocol to shorten the
Tor handshake process and improve latency performance [1],
Jenson et al. proposed an updated kernel for efficient socket
management [2], Rochet et al. focused on securing users’
location information without compromising latency [3]. Zhang
and Chang proposed a source-driven circuit selection scheme
in which middle relays actively publish latency information,
enabling users to construct low-latency circuits [4]. Although
this approach offers significant gains in speed, it also raises
concerns about anonymity due to the more predictable nature
of relay selection. In contrast, this study relies on reinforce-
ment learning to identify the low-latency circuit based on past
experience, eliminating the need for Tor to publish any latency
information. In another study, Zhang and Chang explored
applying reinforcement learning to the Tor network which
begins with adapting Tor’s probabilistic policy (the consensus
weight for circuit selection) to train reinforcement learning that
learns its own probabilistic policy for selecting low-latency
circuits [5]. This work provided the preliminary ideas and
proof-of-concept. In contrast, this paper studies how choosing
the best performance circuit through reinforcement learning
deprives anonymity.
Building on these findings, this work attempts a novel
application of reinforcement learning for Tor circuit selection.
While reinforcement learning has been successfully applied to
advance network routing in centralized contexts such as soft-
ware defined networks [6], vehicular networks [7], drone net-
works [8], and even underwater wireless sensor networks [9],
its potential in the decentralized environment of Tor remains
largely unexplored. Tor users have the ability to control relay
selection and customize circuit construction; however, no prior
research has leveraged reinforcement learning techniques to
optimize circuit construction decisions. By applying reinforce-
ment learning, our study seeks to be the first to assess the
feasibility of such application.
III. METHODOLOGY
A. Apply Reinforcement Learning on Tor
We apply reinforcement learning for circuit selection while
adhering to Tor’s circuit construction rules, but without using
2025 Silicon Valley Cybersecurity Conference (SVCC) | 979-8-3315-3429-5/25/$31.00 ©2025 IEEE | DOI: 10.1109/SVCC65277.2025.11133620
Authorized licensed use limited to: UNIV OF COLORADO COLORADO SPRINGS. Downloaded on September 02,2025 at 18:55:30 UTC from IEEE Xplore. Restrictions apply.