Home › NLP & Transformers › GRPO

GRPO — Group Relative Policy Optimization

The RL algorithm behind DeepSeek-R1. Instead of training a separate value network like PPO does, GRPO samples a group of G completions for each prompt and uses the group's mean reward as the baseline. Each completion's advantage is just (reward − group mean) / group std. Half the parameters of PPO at training time, comparable quality, and the trick that made R1's reasoning trace possible.

Configuration

Group size G 8

Number of completions sampled per prompt

Reward profile

KL coefficient β 0.04

KL penalty against reference policy (anchors training)

Clip range ε 0.20

Per-PPO trust region clip on the importance ratio

Group statistics

Mean μ

—

Std σ

—

Best reward

—

Worst reward

—

Step through

Press Step to walk through one GRPO update.

One GRPO update

Idle

PPO vs GRPO — what changed

PPO (InstructGPT, ChatGPT)

~2× params

• Policy π_θ — the LLM
• Reference π_ref — frozen pre-RLHF copy (KL anchor)
• Reward model r_φ — scores completions
• Value/critic V_ψ — predicts expected reward (~same size as policy)
• Advantage estimated via GAE on critic outputs

GRPO (DeepSeek-R1)

no critic

• Policy π_θ — the LLM
• Reference π_ref — KL anchor (same as PPO)
• Reward model r_φ — or rule-based reward
• No value/critic. Sample G completions, baseline = group mean
• Advantage = (r_i − μ) / σ within the group

GRPO loss

L_GRPO = -1/G · Σᵢ [ min( ratioᵢ · Aᵢ ,  clip(ratioᵢ, 1−ε, 1+ε) · Aᵢ ) ]
                                    + β · D_KL( π_θ ‖ π_ref )

where:
  ratioᵢ = π_θ(oᵢ|s) / π_{θ_old}(oᵢ|s)        # importance weight
  Aᵢ     = (rᵢ − μ) / σ                       # group-relative advantage  ← the trick
  μ      = mean(r₁ … r_G)
  σ      = std(r₁ … r_G)

▶ Why this matters

The critic is a luxury you can't afford. A PPO value network is roughly the size of the policy (often a copy of the LLM). For a 70B-param policy, you need ~140B params resident on the GPUs during training. GRPO drops the critic entirely.

Group baseline = sample-mean Monte Carlo estimate. The whole purpose of the critic is to estimate E[r | s] — the expected reward for a state. If you sample G completions and compute their mean, you have an unbiased estimate without a learned value function. The variance is higher, but so what? You're already paying for G completions per prompt to do the policy update.

Why this enabled R1. DeepSeek showed that with GRPO + a simple rule-based reward (e.g., "is this final math answer correct?"), an LLM can be RL-trained without any SFT initialization and develop emergent reasoning chains. The "aha moment" graphs in the R1-Zero paper show the model spontaneously learning to backtrack and reflect — pure RL with no human reasoning demonstrations. The lower training cost from GRPO is what made the experiment economically viable.

Tradeoffs. GRPO needs G completions per prompt — typically G=64 for DeepSeek — which costs G× more inference at training. PPO does 1 completion per prompt + value head update. So GRPO trades critic parameters for inference compute. For modern accelerators with under-utilized FLOPs at small batch sizes, this trade tips heavily in GRPO's favor.