Researchers from Meta FAIR, Cornell, and CMU achieved 91.8% accuracy on GSM8K math problems using just 13 trainable parameters — totaling 26 bytes — on Qwen2.5-7B. Their TinyLoRA method builds on LoRA-XS but replaces trainable matrices with a shared low-dimensional vector projected through fixed random tensors, allowing parameter counts to scale down from millions to single digits through aggressive weight sharing.

The breakthrough isn't just the parameter efficiency — it's that reinforcement learning fundamentally changes the game at extreme scales. While supervised fine-tuning forces models to absorb noisy human demonstrations token-by-token, RL with Group Relative Policy Optimization provides cleaner binary signals that focus on correctness over style. The researchers found RL requires 100-1,000x fewer parameters than SFT to reach equivalent performance, suggesting most fine-tuning is learning irrelevant formatting rather than reasoning.

This work directly challenges the assumption that meaningful model adaptation requires millions of parameters. Standard LoRA on Llama3-8B needs roughly 3 million parameters minimum, making TinyLoRA's sub-100 parameter updates a genuine paradigm shift. The optimal configuration uses SVD rank r=2 and careful parameter sharing across layers, though the method's reliance on pre-computed SVD decomposition adds complexity.

For developers, this could dramatically reduce fine-tuning costs and enable personalization at unprecedented scales. If a single parameter can meaningfully adapt model behavior, we're looking at potential deployment scenarios where thousands of task-specific variants cost almost nothing to store and serve. The catch: you need RL infrastructure, not just supervised training pipelines.