[Arxiv 2024] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs
编程之家470
更新时间:2026-04-04 07:47:07
Contents
- Introduction
- Method
- Experiments
- References
Introduction
- 作者提出 PrefixQuant,基于 QuaRot,通过在 WA 量化时保持关键词元无损并加上 EfficientQAT 微调,能在 W4A4 static quantization 上做到比较好的量化效果;但和 CusionCache 一样,PrefixQuant 尽管可以保持所有关键词元无损,但却没有讨论过加上 prefix 后会对模型精度产生怎样的影响
Method
- 作者发现,对于 static quantization,由于关键词元与其他词元的激活值分布显著不同,如果不对关键词元做特殊处理,校准得到的量化参数会损害非关键词元的量化精度,例如关键词元的 down_proj 输入上会存在 massive outlier、KV cache 则会特别平坦;如果能保持关键词元无损,对其他 tokens 做校准,就能得到更小的量化范围,提升量化精度
- Definition of Outlier Token. 通过 down proj 的输入激活值定位关键词元
其中,
η
=
64
\eta=64
η=64
- Number of Outlier Tokens. 通过校准集统计出每种模型中关键词元的数量
o
=
⌈
max
(
O
)
⌉
o=\lceil\max(\mathbf O)\rceil
o=⌈max(O)⌉,其中
O
∈
R
b
\mathbf O\in\R^b
O∈Rb 为所有 transformer block 中统计的关键词元数量
- Which Tokens to Prefix? top-
o
o
o high-frequency outlier tokens + [BOS]
- Block-wise Fine-tuning. 采用 EfficientQAT 微调 scale & weights
Experiments
- Settings. 权重 per-channel symmetric quantization,KV cache per-head symmetric static quantization for 4-bit and per-tensor symmetric static quantization for 8-bit,激活值 per-tensor static quantization;校准数据集为 8 Pile samples with a 1024 sequence length,通过 grid search 找到 scale 初始值;微调数据集为 512 samples from Pile with a 1024 context length
- Comparison Results.
- Results on weight-only quantization.
- Inference Speed. (1) Static Quantization Speedup.
(2) Linear Layer Speedup. For low-bit matrix multiplication, we use the 4-bit GEMM kernel from CUTLASS and design a custom kernel for W4A4 GEMV. We also integrate the de-quantization process into the GEMM and GEMV kernels.
(3) End-to-end speedup. 这里测速没有用 KV cache 量化 (it saves memory footprint through more computation overhead and only achieves speedup with large batch sizes)
- Ablation Studies. (1) Main Components.
(2) Number of Prefixed Tokens.(3) Content of Prefixed Tokens.
- Quantization Time.
References
- Chen, Mengzhao, et al. “PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.” arXiv preprint arXiv:2410.05265 (2024).
- code: https://github/chenmnz/prefixquant
本文发布于:2025-08-13,感谢您对本站的认可!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:staticQuantizationPrefixQuantarxivOutliers
发布评论