Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3288–3305 December 7-11, 2022 ©2022 Association for Computational Linguistics AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models Se Jung Kwon 1 , Jeonghoon Kim 1 , Jeongin Bae 1,4 , Kang Min Yoo 1,2,3 , Jin-Hwa Kim 2,3 , Baeseong Park 1 , Byeongwook Kim 1 , Jung-Woo Ha 2 , Nako Sung 1 and Dongsoo Lee 1 1 NAVER CLOVA 2 NAVER AI Lab 3 SNU AIIS 4 KAIST Abstract There are growing interests in adapting large-scale language models using parameter- efficient fine-tuning methods. However, accel- erating the model itself and achieving better inference efficiency through model compres- sion has not been thoroughly explored yet. Model compression could provide the bene- fits of reducing memory footprints, enabling low-precision computations, and ultimately achieving cost-effective inference. To combine parameter-efficient adaptation and model com- pression, we propose AlphaTuning consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task. Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary param- eters and a separate set of scaling factors. Dur- ing the adaptation phase, the binary values are frozen for all tasks, while the scaling factors are fine-tuned for the downstream task. We demon- strate that AlphaTuning, when applied to GPT- 2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10× compression ratio under 4-bit quantization and >1,000× reduction in the number of trainable parameters. 1 Introduction Self-supervised learning facilitates the increased number of parameters to construct pre-trained lan- guage models (PLMs) (e.g., Brown et al. (2020); Devlin et al. (2019)). We expect the continuation of model scaling of the PLMs, especially for the Transformers (Vaswani et al., 2017), because their general capability follows the power-law in param- eter size, exhibiting "the high-level predictability and appearance of useful capabilities" (Ganguli et al., 2022). Therefore, the Transformer-based Corresponding author: sejung.kwon@navercorp.com Work done while at NAVER CLOVA PLMs have been studied with great enthusiasm for various applications including natural language pro- cessing (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020; Smith et al., 2022; Rae et al., 2021; Hoffmann et al., 2022a; Chowdhery et al., 2022; Kim et al., 2021a), automatic speech recog- nition (Baevski et al., 2020), and computer vision (He et al., 2022; Xie et al., 2022). Despite the impressive zero or few-shot learning performance of PLMs, additional adaptation steps (e.g., fine-tuning on a target task) are required to further enhance performance on downstream tasks. Since each downstream task needs to load/store independent adaptation outcomes, if we aim to de- ploy multiple instances of distinct tasks, adapting PLMs with limited trainable parameters is crucial for the efficient deployment (Li et al., 2018). Thus, various parameter-efficient adaptation techniques, such as adapter modules (Houlsby et al., 2019), low-rank adaptation (Hu et al., 2022), prefix-tuning (Li and Liang, 2021), prompt tuning (Liu et al., 2021a; Gao et al., 2020), and p-tuning (Liu et al., 2021b), are proposed. Although trainable parameters can be signifi- cantly reduced by parameter-efficient adaptation schemes, we notice that the memory footprints for inference are not reduced compared to those of PLMs 1 . To enable efficient deployments of multiple downstream tasks, we incorporate model compression and parameter-efficient adaptation. We argue that previous model compression tech- niques were not practical solutions in terms of parameter-efficiency for adaptations. For example, Quantization-Aware Training (QAT) (Jacob et al., 2018; Esser et al., 2020) can perform full fine- tuning coupled with model compression; however, each task needs dedicated memory storage as much as that of a compressed PLM. Our key observation to achieve a compression-aware parameter-efficient 1 In practice, the adaptation is usually implemented by adding small additional parameters to PLMs. 3288