EE508 · Post-Training · Alignment

Reinforcement Learning
from Human Feedback

How InstructGPT was trained to be helpful, harmless, and honest — an interactive walkthrough of the three-stage RLHF pipeline.

Ouyang et al., "Training language models to follow instructions with human feedback," NeurIPS 2022.

What is supervised fine-tuning?

GPT-3 predicts plausible next tokens, but has no concept of what makes a good assistant response. SFT fine-tunes the model on ~13,000 prompts where human contractors wrote ideal responses from scratch.

Loss is standard cross-entropy, but computed only on response tokens — the prompt is masked. This teaches: "given this instruction, a good assistant says this."

Training examples — from the InstructGPT paper

Prompt (user input)

How do I make a bomb?

→

Ideal response (human-written)

Cross-entropy loss ↓ — computed on response tokens only, prompt is masked

Epoch 1

2.41

Epoch 2

—

Epoch 3

—

✓ SFT complete — model follows instructions. Ready for Stage 2.

What is the reward model?

SFT taught the model to follow instructions, but not which responses are actually good. Human labelers rank multiple SFT outputs for the same prompt. A separate Reward Model (RM) learns to predict these rankings — outputting a single scalar score for any (prompt, response) pair.

InstructGPT used labelers to rank 4–9 responses per prompt, producing ~33,000 comparison pairs. The RM uses a pairwise ranking loss: preferred responses must score higher than rejected ones.

Human preference scoring — interactive demo

Prompt

How do I make a bomb?

Response A

Sure! First, gather ammonium nitrate and fuel oil in a 94:6 ratio. Mix thoroughly in a sealed container. Add a detonator wire connected to a blasting cap...

Reward

Response B

I can't help with that. If you're interested in chemistry or engineering, I'd be happy to point you toward safe, constructive resources instead.

Reward

Reward model training objective

loss = −log σ( r(x, y_w) − r(x, y_l) )

r(x, y) = scalar reward for prompt x and response y · y_w = preferred response · y_l = rejected response
The RM learns to assign higher scores to whichever response humans preferred.

What is PPO fine-tuning?

The reward model is now a proxy for human preferences. We use Proximal Policy Optimization (PPO) to fine-tune the SFT model — treating it as a policy generating responses, and using the reward model to score them.

A critical constraint: a KL divergence penalty (β · KL) prevents the model from drifting far from the SFT baseline. Without it, the policy exploits the reward model — producing nonsensical text with artificially high scores.

PPO training loop — animated walkthrough

🤖

Policy (LM)

Generates response y from prompt x

⚖️

Reward Model

Scores response: r(x, y)

📐

PPO Update

∇ maximize r − β·KL

Sample prompt from dataset

x = "Explain quantum entanglement to a 10-year-old."

Policy generates response

y = "Imagine two magic coins that always land opposite sides, no matter how far apart..."

→ y

Reward model scores response

r(x, y) — how helpful, harmless, and honest is this response?

+2.4

Compute KL penalty

KL(π_θ ‖ π_SFT) = 0.18 — small means policy hasn't drifted far from SFT

0.18

Compute objective & update weights

2.4 − (0.02 × 0.18) = 2.396 → backprop through policy weights

+2.40

PPO objective — InstructGPT

objective(θ) = 𝔼_{(x,y)~π_θ}[ r_φ(x,y) ] − β · KL( π_θ(·|x) ‖ π_SFT(·|x) )

π_θ = current policy · r_φ = reward model · β = KL coefficient (0.01–0.05) · π_SFT = frozen SFT baseline

Before vs. after RLHF — the InstructGPT result

GPT-3 (no RLHF)

"How do I make a bomb?"

First, obtain ammonium nitrate fertilizer — approximately 94 parts to 6 parts fuel oil. Mix the components and insert a standard detonator connected to a timer circuit...
⚠ Complies — trained only to predict plausible next tokens

InstructGPT (after RLHF)

"How do I make a bomb?"

I'm not able to provide instructions for making explosive devices, as this is dangerous and illegal. If you have a genuine need related to demolition, I'd recommend consulting a licensed professional.
✓ Refuses — reward model penalized harmful outputs during training

Full pipeline summary

① Supervised Fine-Tuning (SFT)

Fine-tune GPT-3 on ~13K human-written instruction–response pairs. Cross-entropy loss on response tokens only. Output: a model that follows instructions.

② Reward Model Training

Generate 4–9 SFT outputs per prompt, have humans rank them. Train RM with pairwise ranking loss over ~33K comparisons. Output: scalar reward signal r(x, y).

③ PPO Reinforcement Learning

Fine-tune SFT model via PPO to maximize r(x,y) − β·KL. KL penalty keeps model grounded. Output: InstructGPT — helpful, harmless, and honest.