Why? Gaussian noise for A breaks symmetry, while initializing B with zeros ensures the initial adaptation is perfectly 0. This prevents disruptive updates and ensures the model behaves exactly like the base model at step 1.
2. Training (Backward Pass)
∂L/∂W0Original weights (frozen)
Adapter weights (trainable)
∂L/∂A=Bᵀ·(∂L/∂ΔW)
∂L/∂B=Aᵀ·(∂L/∂ΔW)
Optimizer states (Adam momentum/variance) are only tracked for A and B, vastly reducing VRAM usage.
3. Adapter Output & Inference Merging
Saved Adapter (Disk)
adapter_config.json{r, alpha}
lora_A.weight[16, 1024]
lora_B.weight[1024, 16]
Size: ~0.06 MB (fp16)
Weight Merging (Inference)
Wnew=W0+αrB·A
• Adapter matrices are multiplied together and scaled.
• Result is permanently added to base weights.
• Result: Zero latency overhead during production inference.