There is agreement that LoRA underperforms in settings that resemble pre-training,LoRA Learns Less and Forgets Less (Biderman et al, 2024) namely those with very large datasets that exceed the storage limits of LoRA parameters. But for dataset sizes that are typical in post-training, LoRA has sufficient capacity to store the essential information. However, this fact makes no guarantees regarding sample efficiency and compute efficiency. The question is: can LoRA match the performance of full fine-tuning, and if so, under which conditions?
Feels like this can be explained by the Linear Direction Hypothesis, each direction encodes a "latent" or some form of "concept". Thus, when we use LoRa its like doing PCA and identifying the most relevant directions for this dataset and encoding them in the model, making the model more "sensitive" to them. When the dataset size is larger than the rank then we cant quite encode all necessary information.