The RL algorithm behind DeepSeek-R1. Instead of training a separate value network like PPO does, GRPO samples a group of G completions for each prompt and uses the group's mean reward as the baseline. Each completion's advantage is just (reward − group mean) / group std. Half the parameters of PPO at training time, comparable quality, and the trick that made R1's reasoning trace possible.
L_GRPO = -1/G · Σᵢ [ min( ratioᵢ · Aᵢ , clip(ratioᵢ, 1−ε, 1+ε) · Aᵢ ) ] + β · D_KL( πθ ‖ πref ) where: ratioᵢ = πθ(oᵢ|s) / πθ_old(oᵢ|s) # importance weight Aᵢ = (rᵢ − μ) / σ # group-relative advantage ← the trick μ = mean(r₁ … r_G) σ = std(r₁ … r_G)
The critic is a luxury you can't afford. A PPO value network is roughly the size of the policy (often a copy of the LLM). For a 70B-param policy, you need ~140B params resident on the GPUs during training. GRPO drops the critic entirely.
Group baseline = sample-mean Monte Carlo estimate. The whole purpose of the critic is to estimate E[r | s] — the expected reward for a state. If you sample G completions and compute their mean, you have an unbiased estimate without a learned value function. The variance is higher, but so what? You're already paying for G completions per prompt to do the policy update.
Why this enabled R1. DeepSeek showed that with GRPO + a simple rule-based reward (e.g., "is this final math answer correct?"), an LLM can be RL-trained without any SFT initialization and develop emergent reasoning chains. The "aha moment" graphs in the R1-Zero paper show the model spontaneously learning to backtrack and reflect — pure RL with no human reasoning demonstrations. The lower training cost from GRPO is what made the experiment economically viable.
Tradeoffs. GRPO needs G completions per prompt — typically G=64 for DeepSeek — which costs G× more inference at training. PPO does 1 completion per prompt + value head update. So GRPO trades critic parameters for inference compute. For modern accelerators with under-utilized FLOPs at small batch sizes, this trade tips heavily in GRPO's favor.