Home NLP & Transformers GRPO

GRPO — Group Relative Policy Optimization

The RL algorithm behind DeepSeek-R1. Instead of training a separate value network like PPO does, GRPO samples a group of G completions for each prompt and uses the group's mean reward as the baseline. Each completion's advantage is just (reward − group mean) / group std. Half the parameters of PPO at training time, comparable quality, and the trick that made R1's reasoning trace possible.

One GRPO update

Idle

PPO vs GRPO — what changed

PPO (InstructGPT, ChatGPT)
~2× params
  • Policy πθ — the LLM
  • Reference πref — frozen pre-RLHF copy (KL anchor)
  • Reward model rφ — scores completions
  • Value/critic Vψ — predicts expected reward (~same size as policy)
  • • Advantage estimated via GAE on critic outputs
GRPO (DeepSeek-R1)
no critic
  • Policy πθ — the LLM
  • Reference πref — KL anchor (same as PPO)
  • Reward model rφ — or rule-based reward
  • • No value/critic. Sample G completions, baseline = group mean
  • • Advantage = (ri − μ) / σ within the group

GRPO loss

L_GRPO = -1/G · Σᵢ [ min( ratioᵢ · Aᵢ ,  clip(ratioᵢ, 1−ε, 1+ε) · Aᵢ ) ]
                                    + β · D_KL( πθ ‖ πref )

where:
  ratioᵢ = πθ(oᵢ|s) / πθ_old(oᵢ|s)        # importance weight
  Aᵢ     = (rᵢ − μ) / σ                       # group-relative advantage  ← the trick
  μ      = mean(r₁ … r_G)
  σ      = std(r₁ … r_G)
Why this matters

The critic is a luxury you can't afford. A PPO value network is roughly the size of the policy (often a copy of the LLM). For a 70B-param policy, you need ~140B params resident on the GPUs during training. GRPO drops the critic entirely.

Group baseline = sample-mean Monte Carlo estimate. The whole purpose of the critic is to estimate E[r | s] — the expected reward for a state. If you sample G completions and compute their mean, you have an unbiased estimate without a learned value function. The variance is higher, but so what? You're already paying for G completions per prompt to do the policy update.

Why this enabled R1. DeepSeek showed that with GRPO + a simple rule-based reward (e.g., "is this final math answer correct?"), an LLM can be RL-trained without any SFT initialization and develop emergent reasoning chains. The "aha moment" graphs in the R1-Zero paper show the model spontaneously learning to backtrack and reflect — pure RL with no human reasoning demonstrations. The lower training cost from GRPO is what made the experiment economically viable.

Tradeoffs. GRPO needs G completions per prompt — typically G=64 for DeepSeek — which costs G× more inference at training. PPO does 1 completion per prompt + value head update. So GRPO trades critic parameters for inference compute. For modern accelerators with under-utilized FLOPs at small batch sizes, this trade tips heavily in GRPO's favor.