arXiv:2506.22396v1 [cs.CL] 27 Jun 2025 - Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization Danush Khanna 1 , Aditya Kumar Guru 1 , Srivarshinee Sridhar 2 , Zidan Ahmed 3 , Rubhav Bahirwani 1 , Meetu Malhotra 4 , Vinija Jain 5∗ , Aman Chadha 6† , Kripabandhu Ghosh 7 ,Amitava Das 8 1 Manipal University Jaipur, 2 Vellore Institute of Technology, 3 NIT Silchar, 4 Harrisburg University of Science and Technology, 5 Meta AI, USA, 6 Amazon AI, USA, 7 IISER Kolkata, 8 BITS Pilani, Goa Abstract Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of to- tal cost. While training-time efﬁciency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregres- sive decoding. Existing approaches—such as pruning, quantization, early exits, and specula- tive decoding—often require retraining, architec- tural changes, or disrupt decoding compatibil- ity. We introduce QuickSilver, a modular, token- level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four syner- gistic mechanisms: (i) Dynamic Token Halting, which halts computation for tokens with con- verged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual Token Fusion, which collapses redundant tokens into shared paths to shrink sequence length. Un- like speculative decoding or MoE routing, Quick- Silver operates entirely on frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with negligible perplexity degradation (≤0.2). To foster future research in this area, we make our implementation publicly available. 1 ∗ Work done outside of role at Meta. † Work done outside of role at Amazon. 1 Inference-Time Speed: Why It Matters LLMs now exceed human-level performance across many NLP tasks [OpenAI, 2023; Bubeck et al., 2023], yet inference, not training, has become the dominant bottleneck in deployment [Patterson and Gonzalez, 2021; Sanh et al., 2022]. Real-world usage patterns make inference responsible for over 90% of total energy and compute cost [Patterson et al., 2022; Desislavov et al., 2021], positioning inference-time optimization as a critical frontier. User Interactivity. LLMs in real-time applications such as chatbots or translation tools demand sub- second token-level latency [Chen et al., 2023a; Levy et al., 2023]. Even slight delays degrade user expe- rience [Shuster et al., 2022; Ni et al., 2022], while micro-optimizations can compound to improve re- sponsiveness dramatically. Scalability and Cost. Widespread LLM adop- tion stresses infrastructure. Faster inference boosts throughput without linearly scaling com- pute [Barham et al., 2022]. Strategies like early exits [Schwartz et al., 2020; Elbayad et al., 2020a], adaptive computation [Graves, 2016], and specula- tive decoding [Leviathan et al., 2022] reduce cost but often require retraining or architectural coordination. Environmental Impact. Inference, executed mil- lions of times daily, is the primary contributor to 1 https://anonymous.4open.science/r/Quicksilver/