Reinforcement Learning on Tor: Prioritizing Performance Compromises Anonymity Kelei Zhang ∗† Amanul Islam , Sang-Yoon Chang Computer Science, University of Colorado Colorado Springs, Colorado Springs, CO 80918 Department of Informatics, Fort Hays State University, Hays, KS 67601 {jzhang5,aislam2,schang2}@uccs.edu Abstract—The onion routing (Tor) protects the IP and behav- ioral anonymity on the Internet by routing the networking packet through multiple randomly selected relay nodes between the source and the destination. It therefore adds transmission delays, and the latency cost is greater than not using Tor (i.e., when the packet is transmitted directly to the destination). Machine learning can be used to learn about the Tor circuit delays, given the source and destination, to identify the circuits with smaller latency cost. In this paper, however, we show how using reinforcement learning in its by-default, deterministic fashion can deprive anonymity. More specifically, choosing the performance- best circuit corresponding to the exploitation strategy provides no anonymity with an entropy of zero. Such approach can eliminate the anonymity protection and performs worse than disabling Tor altogether. We therefore recommend a strategic and appropriate application of machine learning which not only advances the performance but also take the anonymity into account for future research to advance Tor and Internet anonymity. Index Terms—Reinforcement Learning, Tor I. INTRODUCTION In the Tor network, user data is encrypted in multiple layers and routed through a randomly selected circuit (a series of relays). This practice of circuit diversity make it very difficult for adversaries to correlate traffic and compromise user privacy. Tor adds multiple relays between the source and the destination nodes to make it more difficult to track the source-destination networking for the anonymity protection. Tor adding relays inherently incurs additional latency com- pared to not using such anonymous routing protocol. Reinforcement learning algorithms learn optimal actions by iteratively estimating expected rewards for different state- action pairs. During this process, they balance exploration (try new actions) and exploitation (select the action known to provide the best reward) in order to maximize cumulative outcomes. In our problem, the action is the Tor circuit selec- tion, and the reward is the latency performance. To highlight the strategies’ implications on anonymity, we call the pure exploration strategy Random Explore (always randomly select) and the pure exploitation strategy Deterministic Best (select the circuit having the best performance). In this study, we apply reinforcement learning to the Tor network and investigate the negative impact on anonymity caused by focusing heavily on exploitation. Deterministic Best strategy provides zero anonymity, because the same circuit gets chosen all the time. II. RELATED WORK Prior works have explored various strategies to enhance performance while balancing the inherent trade-off between speed and anonymity. For instance, Basyoni et al. investigated the integration of the the QUIC protocol to shorten the Tor handshake process and improve latency performance [1], Jenson et al. proposed an updated kernel for efficient socket management [2], Rochet et al. focused on securing users’ location information without compromising latency [3]. Zhang and Chang proposed a source-driven circuit selection scheme in which middle relays actively publish latency information, enabling users to construct low-latency circuits [4]. Although this approach offers significant gains in speed, it also raises concerns about anonymity due to the more predictable nature of relay selection. In contrast, this study relies on reinforce- ment learning to identify the low-latency circuit based on past experience, eliminating the need for Tor to publish any latency information. In another study, Zhang and Chang explored applying reinforcement learning to the Tor network which begins with adapting Tor’s probabilistic policy (the consensus weight for circuit selection) to train reinforcement learning that learns its own probabilistic policy for selecting low-latency circuits [5]. This work provided the preliminary ideas and proof-of-concept. In contrast, this paper studies how choosing the best performance circuit through reinforcement learning deprives anonymity. Building on these findings, this work attempts a novel application of reinforcement learning for Tor circuit selection. While reinforcement learning has been successfully applied to advance network routing in centralized contexts such as soft- ware defined networks [6], vehicular networks [7], drone net- works [8], and even underwater wireless sensor networks [9], its potential in the decentralized environment of Tor remains largely unexplored. Tor users have the ability to control relay selection and customize circuit construction; however, no prior research has leveraged reinforcement learning techniques to optimize circuit construction decisions. By applying reinforce- ment learning, our study seeks to be the first to assess the feasibility of such application. III. METHODOLOGY A. Apply Reinforcement Learning on Tor We apply reinforcement learning for circuit selection while adhering to Tor’s circuit construction rules, but without using 2025 Silicon Valley Cybersecurity Conference (SVCC) | 979-8-3315-3429-5/25/$31.00 ©2025 IEEE | DOI: 10.1109/SVCC65277.2025.11133620 Authorized licensed use limited to: UNIV OF COLORADO COLORADO SPRINGS. Downloaded on September 02,2025 at 18:55:30 UTC from IEEE Xplore. Restrictions apply.