15 Matching Annotations
  1. Oct 2025
    1. There is agreement that LoRA underperforms in settings that resemble pre-training,LoRA Learns Less and Forgets Less (Biderman et al, 2024) namely those with very large datasets that exceed the storage limits of LoRA parameters. But for dataset sizes that are typical in post-training, LoRA has sufficient capacity to store the essential information. However, this fact makes no guarantees regarding sample efficiency and compute efficiency. The question is: can LoRA match the performance of full fine-tuning, and if so, under which conditions?

      Feels like this can be explained by the Linear Direction Hypothesis, each direction encodes a "latent" or some form of "concept". Thus, when we use LoRa its like doing PCA and identifying the most relevant directions for this dataset and encoding them in the model, making the model more "sensitive" to them. When the dataset size is larger than the rank then we cant quite encode all necessary information.

    2. h training step: LoRA training curves for various ranks on T

      high-rank LoRA and FullFT have identical curves. Medium/low rank fall off when they hit capacity limits. This is the visual proof of the main claim.

    3. We investigated the effects of applying LoRA to different layers in the network. The original paper by Hu et al. recommended applying LoRA only to the attention matrices, and many subsequent papers followed suit, though a recent trend has been to apply it to all layers.Similar to our results, the QLoRA paper also found that LoRA performed worse than MLP or MLP+attention, though they found that MLP+attention > MLP > attention, whereas we found the first two to be roughly equal. Indeed, we achieved far better results when applying LoRA to all layers, in particular, the MLP (including MoE) layers. In fact, applying LoRA to the attention matrices shows no additional benefits beyond applying it to the MLPs only.

      contradicts common practice of attention-only LoRA!

    4. We find that the optimal learning rate for FullFT is lower by a factor of 10 than for high-rank LoRAs.See Biderman et al. (2024), Figure S1, for an experiment with sampling evals, which finds a similar 10x ratio. We’ll return to this in our discussion of LoRA hyperparameters later on. The optimal LR seems to be similar for all the LoRA runs across different ranks; we give a theoretical explanation for this finding below. However, there does seem to be some rank dependence, with lower optimal LR for rank=1 than for higher-rank LoRAs. The optimal LR changes by a factor of less than 2 between rank=4 and rank=512.

      Key practical finding, multiply by 10 when switching, at least until a better method is identified...

    5. A key finding from our experiments is that LoRA fully matches the learning performance of FullFT when running policy gradient algorithms for reinforcement learning, even with ranks as low as 1. For these experiments, we used a basic policy gradient algorithm with an importance sampling correction; objective=∑tplearnerpsamplerAdvt\text{objective} =\sum_t \frac{p_{\text{learner}}}{p_{\text{sampler}}} Adv_tobjective=∑t​psampler​plearner​​Advt​.See Your Efficient RL Framework Secretly Brings You Off-Policy RL Training We used a GRPO-like centering schemeDeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao et al, 2024) where we sample multiple completions per problem and subtract the mean reward per group.

      Rank 1 is enough for LoRA when doing RL! Huge computational benefit!

    6. To speed up experiments, we restricted the samples to a length of 8192 tokens for training and evaluation. This sample length allows for backtracking and reasoning but limits the performance, relative to longer chain-of-thought

      Not just matching metrics - behaviors match too further strengthening the claim that LoRA is as robust as Full

    7. For supervised fine-tuning on small-to-medium-sized instruction-tuning and reasoning datasets, LoRA performs the same as full fine-tuning. For datasets that exceed LoRA capacity, LoRA underperforms FullFT. Rather than the loss reaching a distinct floor that it can’t go below, LoRA results in worse training efficiency that depends on the relationship between model capacity to dataset size. In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning — it pays a larger penalty in loss as batch size increases beyond some point. This penalty is not mitigated by increasing the LoRA rank; it is a property of the product-of-matrices parametrization, which has different training dynamics than optimizing the original weight matrix. Even in small data settings, LoRA performs better when applied to all weight matrices, especially MLP and MoE layers. Attention-only LoRA underperforms even when we match the number of trainable parameters by using higher rank for attention-only LoRA. LoRA performs equivalently to FullFT for reinforcement learning even with small ranks. We find that RL requires very low capacity, a result we anticipated based on information-theoretical arguments.

      For 2. The fact that training efficiency is worse seems to indicate that we somehow keep on refining the combination of components that we encode to better fit the dataset, but still have the fundamental limitation of not enough directions. For 3, this is not quite understood or explained well, only the throwaway line regarding the product of matrices with no explanation

    8. In supervised learning, we measured log loss rather than employing sampling-based evals, with the same goal of generality in mind. Log loss measurement gives clean results and scaling laws over ranges of training steps and training parameters.

      Rather than focusing on specific samples, this paper looks for generalizable patterns

    9. Learning rates in short and long runs# The typical initialization of LoRA creates an implicit schedule of change in the effective learning rate. This leads to differences between short and long training runs, and some differences in the shape of learning curves compared to FullFT. At the start of training, BBB is initialized to zero. While BBB is very small, changes in AAA have negligible effects on the adapter BABABA which is added to the original network weights. As BBB grows larger, updates to AAA start to have a bigger impact on the network outputs, with the effective learning rate increasing over the course of training as BBB approaches AAA in scale. We found that by the end of the full training runs on the Tulu3 and OpenThoughts datasets, the BBB matrices ended up with larger spectral norms than the AAA matrices. This implies that the optimal LR should be set higher for shorter training runs. Preliminary evidence suggests an optimal multiplier around 15x over the FullFT for short runsBased on anecdotal evidence, the higher multiplier is effective under ~100 steps or so., converging to the aforementioned 10x multiplier for longer runs.

      The LR may be a product of the initialization more so than the inherent properties of the two methods.

    10. LoRA is applied to all layers of the network, especially the MLP/MoE layers which house most of the parameters. LoRA works well when not capacity constrained, i.e., the number of trainable parameters exceeds the amount of information to be learned, which can be estimated in terms of dataset size. When (1) is satisfied, we get similar learning dynamics to FullFT at the very start of training. Then, as per (2), LoRA continues to look like FullFT until we start reaching capacity limits.

      Here remember the "bits" argument for before O(1) for RL, O(tokens) for SFT

    11. Past workPhysics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws (Allen-Zhu and Li, 2024) has shown that neural networks can store 2 bits per parameter

      Rough rule of thumb for capacity. Helps estimate if dataset will fit in given rank.

    12. One classic observation is that when minimizing log-loss, the total log-loss measured during the first epoch of training provides a measurement of the dataset’s description length. That is, an upper bound for the number of bits required to memorize the dataset. LLM datasets usually have a loss of around 1 bit (0.69 nats) per token, depending on dataset and model size.

      This is the way to estimate information content of your dataset. ~1 bit/token for typical LLM data

    13. Open questions# There are several questions related to our results that we would love to see investigated in the future: Sharpening our predictions of LoRA performance and the precise conditions under which it matches full fine-tuning. We have roughly characterized the regime of equal performance and can estimate the required capacity in terms of tokens or episodes, but we can’t yet make accurate forecasts. Our theoretical understanding of LoRA learning rates and training dynamics is limited. A fuller theory that explains the ratio between LoRA and FullFT learning rates would be valuable. How do LoRA variants such as PiSSAPiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models (Meng, Wang & Zhang, 2024) perform when measured according to the methodology in this article? There are various options for applying LoRA to MoE layers. LoRA users would benefit from an investigation into how well they perform, and how compatible each approach is with methods like tensor parallelism and expert parallelism that are important for large MoE models.

      The capacity predictions are still rough , would want a way to estimate the rank needed given dataaset and model sizes. Would want to better understand the LR to choose more than "10x the one for SFT" + How do you even find the one for SFT in the first place?

    14. The total number of multiply-adds is 2N2+6NR2N^2 + 6NR2N2+6NR. With R≪NR \ll NR≪N, this is slightly more than 23\frac{2}{3}32​ of 3N23N^23N2. If we plotted LoRA performance over FLOPsThis analysis omits FLOPs used for attention, which could be significant in long-context settings. instead of training steps, it would show a clear advantage over FullFT.

      Even beyond the memory and adapter benefits, we have also computational benefits in training

    15. . It seems wasteful to use a terabit of weights to represent updates from a gigabit or megabit of training data. This intuition has motivated parameter efficient fine-tuning (PEFT), which adjusts a large network by updating a much smaller set of parameters.

      This is the core of the paper, we have a mismatch between dataset size and model size whenever we do post training