Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3288–3305
December 7-11, 2022 ©2022 Association for Computational Linguistics
AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation
of Large-Scale Pre-Trained Language Models
Se Jung Kwon
1∗
, Jeonghoon Kim
1
, Jeongin Bae
1,4†
, Kang Min Yoo
1,2,3
, Jin-Hwa Kim
2,3
,
Baeseong Park
1
, Byeongwook Kim
1
, Jung-Woo Ha
2
, Nako Sung
1
and Dongsoo Lee
1
1
NAVER CLOVA
2
NAVER AI Lab
3
SNU AIIS
4
KAIST
Abstract
There are growing interests in adapting
large-scale language models using parameter-
efficient fine-tuning methods. However, accel-
erating the model itself and achieving better
inference efficiency through model compres-
sion has not been thoroughly explored yet.
Model compression could provide the bene-
fits of reducing memory footprints, enabling
low-precision computations, and ultimately
achieving cost-effective inference. To combine
parameter-efficient adaptation and model com-
pression, we propose AlphaTuning consisting
of post-training quantization of the pre-trained
language model and fine-tuning only some
parts of quantized parameters for a target task.
Specifically, AlphaTuning works by employing
binary-coding quantization, which factorizes
the full-precision parameters into binary param-
eters and a separate set of scaling factors. Dur-
ing the adaptation phase, the binary values are
frozen for all tasks, while the scaling factors are
fine-tuned for the downstream task. We demon-
strate that AlphaTuning, when applied to GPT-
2 and OPT, performs competitively with full
fine-tuning on a variety of downstream tasks
while achieving >10× compression ratio under
4-bit quantization and >1,000× reduction in
the number of trainable parameters.
1 Introduction
Self-supervised learning facilitates the increased
number of parameters to construct pre-trained lan-
guage models (PLMs) (e.g., Brown et al. (2020);
Devlin et al. (2019)). We expect the continuation
of model scaling of the PLMs, especially for the
Transformers (Vaswani et al., 2017), because their
general capability follows the power-law in param-
eter size, exhibiting "the high-level predictability
and appearance of useful capabilities" (Ganguli
et al., 2022). Therefore, the Transformer-based
∗
Corresponding author: sejung.kwon@navercorp.com
†
Work done while at NAVER CLOVA
PLMs have been studied with great enthusiasm for
various applications including natural language pro-
cessing (Devlin et al., 2019; Radford et al., 2019;
Brown et al., 2020; Smith et al., 2022; Rae et al.,
2021; Hoffmann et al., 2022a; Chowdhery et al.,
2022; Kim et al., 2021a), automatic speech recog-
nition (Baevski et al., 2020), and computer vision
(He et al., 2022; Xie et al., 2022).
Despite the impressive zero or few-shot learning
performance of PLMs, additional adaptation steps
(e.g., fine-tuning on a target task) are required to
further enhance performance on downstream tasks.
Since each downstream task needs to load/store
independent adaptation outcomes, if we aim to de-
ploy multiple instances of distinct tasks, adapting
PLMs with limited trainable parameters is crucial
for the efficient deployment (Li et al., 2018). Thus,
various parameter-efficient adaptation techniques,
such as adapter modules (Houlsby et al., 2019),
low-rank adaptation (Hu et al., 2022), prefix-tuning
(Li and Liang, 2021), prompt tuning (Liu et al.,
2021a; Gao et al., 2020), and p-tuning (Liu et al.,
2021b), are proposed.
Although trainable parameters can be signifi-
cantly reduced by parameter-efficient adaptation
schemes, we notice that the memory footprints
for inference are not reduced compared to those
of PLMs
1
. To enable efficient deployments of
multiple downstream tasks, we incorporate model
compression and parameter-efficient adaptation.
We argue that previous model compression tech-
niques were not practical solutions in terms of
parameter-efficiency for adaptations. For example,
Quantization-Aware Training (QAT) (Jacob et al.,
2018; Esser et al., 2020) can perform full fine-
tuning coupled with model compression; however,
each task needs dedicated memory storage as much
as that of a compressed PLM. Our key observation
to achieve a compression-aware parameter-efficient
1
In practice, the adaptation is usually implemented by
adding small additional parameters to PLMs.
3288