LoRA: Low-Rank Adaptation of Large Language Models 简读

之前我们谈到 Adapters 与 Prompting 都是轻量级的训练方法,所谓 lightweight-finetuning。今天来看一下另一种轻量级训练大语言模型的方法:

LoRA: Low-Rank Adaptation of Large Language Models

首先来看finetune大规模语言模型的问题:

An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive.

已有方案的问题

为解决finetune大规模语言模型的问题,已有多种方案,比如部分finetune、adapters和prompting。但这些方法存在如下问题:

  • Adapters引入额外的inference latency (由于增加了层数)
  • Prefix-Tuning比较难于训练
  • 模型性能不如finetuning

Adapter引入Inference Latency

显然,增加模型层数会增加inference的时长:

While one can reduce the overall latency by pruning layers or exploiting multi-task settings, there is no direct ways to bypass the extra compute in adapter layers.

Adapter Extra Inference Latency

从上图可以看出,对于线上batch size为1,sequence length比较短的情况,inference latency的变化比例会更明显。不过个人认为,绝对延迟的区别不大。 :-)

Prefix-Tuning难于训练

与Prefix-Tuning的难于训练相比,LoRA则更容易训练:

We observe that prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameters, confirming similar observations in the original paper.

模型性能不如Full Finetuning

预留一些sequence做adaption会让处理下游任务的可用sequence长度变少,一定程度上会影响模型性能:

More fundamentally, reserving a part of the sequence length for adaptation necessarily reduces the sequence length available to process a downstream task, which we suspect makes tuning the prompt less performant compared to other methods.

LoRA

先看看LoRA的motivation:

We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension.

We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach.

虽然模型的参数众多,但其实模型主要依赖low intrinsic dimension,那adaption应该也依赖于此,所以提出了Low-Rank Adaptation (LoRA)。

LoRA Model

LoRA的思想也很简单,在原始PLM旁边增加一个旁路,做一个降维再升维的操作,来模拟所谓的intrinsic rank。训练的时候固定PLM的参数,只训练降维矩阵A与升维矩阵B。而模型的输入输出维度不变,输出时将BA与PLM的参数叠加。用随机高斯分布初始化A,用0矩阵初始化B,保证训练的开始此旁路矩阵依然是0矩阵。

具体来看,假设预训练的矩阵为 \(W_0 \in \mathbb{R}^{d \times k}\),它的更新可表示为:

\[W_0 + \Delta W = W_0 + BA, B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k} \]

其中秩\(r << min(d,k)\)

这种思想有点类似于残差连接,同时使用这个旁路的更新来模拟full finetuning的过程。并且,full finetuning可以被看做是LoRA的特例(当r等于k时):

This means that when applying LoRA to all weight matrices and training all biases, we roughly recover the expressiveness of full fine-tuning by setting the LoRA rank r to the rank of the pre-trained weight matrices.

In other words, as we increase the number of trainable parameters, training LoRA roughly converges to training the original model, while adapter-based methods converges to an MLP and prefix-based methods to a model that cannot take long input sequences.

LoRA也几乎未引入额外的inference latency,只需要计算 \(W = W_0 + BA\) 即可。

LoRA与Transformer的结合也很简单,仅在QKV attention的计算中增加一个旁路,而不动MLP模块:

We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency.

总结,基于大模型的内在低秩特性,增加旁路矩阵来模拟full finetuning,LoRA是个简单有效的方案来达成lightweight finetuning的目的。