Lecture 08: RLHF — SFT, Reward Model, PPO

核心主题：InstructGPT 三阶段流水线、监督微调、奖励模型、PPO 对齐

1. InstructGPT 三阶段流水线核心

┌─────────────────────────────────────────────────────────────────────────┐
│                    InstructGPT Training Pipeline                        │
├─────────────────┬──────────────────────┬────────────────────────────────┤
│   Stage 1: SFT  │  Stage 2: Reward Model│  Stage 3: PPO (RLHF)         │
│                 │                      │                                │
│  Pretrained LLM │  SFT Model (frozen)  │  RM (frozen) + SFT (ref)      │
│       │         │       │              │       │          │             │
│       ▼         │       ▼              │       ▼          ▼             │
│  Human demos    │  Human preferences   │  RL optimization               │
│  (prompt,resp)  │  (x, y_w ≻ y_l)     │  max r(x,y) - β·KL            │
│       │         │       │              │       │                        │
│       ▼         │       ▼              │       ▼                        │
│  SFT Model      │  Reward Model r_θ    │  PPO Policy π_φ               │
└─────────────────┴──────────────────────┴────────────────────────────────┘

数据量:    ~13K demos        ~33K comparisons      ~31K prompts
标注成本:  高 (写完整回复)    中 (排序两个回复)      无 (自动采样)

核心思想：

Pretrain 学会 "how to speak"（语言能力）
SFT 学会 "format"（指令遵循格式）
RM 捕获 "human preferences"（人类偏好信号）
PPO 在 KL 约束下优化偏好（对齐人类意图）

2. Stage 1: Supervised Fine-Tuning (SFT)

2.1 SFT Loss — 仅计算 response tokens

SFT 损失函数： $$\mathcal{L}_{\text{SFT}} = -\sum_{t \in \text{response}} \log p_\theta(x_t \mid x_{<t})$$

关键细节：

Prompt tokens 的 label 设为 -100（PyTorch CrossEntropyLoss 自动忽略）
只对 response 部分计算梯度 → 模型学习"给定 prompt 如何回答"
本质仍是 next-token prediction，但数据变为 (instruction, response) 对

\text{完整序列: } \underbrace{[\text{BOS}] \text{ instruction tokens}}_{\text{label} = -100} \underbrace{\text{response tokens } [\text{EOS}]}_{\text{计算 loss}}

2.2 Prompt Template (Alpaca 格式)

Below is an instruction that describes a task, paired with an input
that provides further context. Write a response that appropriately
completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}

模板设计原则：

明确的分隔符（### Instruction / ### Response）
一致的格式让模型学会何时开始生成
推理时截断到 "### Response:\n" 后让模型续写

2.3 常用 SFT 数据集

数据集	规模	来源	特点
Alpaca	52K	GPT-3.5 生成	Self-Instruct, 成本低
ShareGPT	90K	用户对话收集	多轮, 真实用户分布
Dolly	15K	人工标注	Databricks 员工标注
OASST1	160K	众包对话树	多轮 + 偏好标注
FLAN Collection	15M	NLP 任务混合	1800+ 任务, 大规模
InstructGPT demos	13K	人工标注	高质量, OpenAI 内部

2.4 实现代码 (SFTDataset)

import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer

class SFTDataset(Dataset):
    """SFT 数据集：仅在 response 部分计算 loss"""
    
    def __init__(self, data, tokenizer, max_length=512):
        self.data = data  # List of {"instruction": ..., "output": ...}
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        
        # 构造 prompt
        prompt = f"### Instruction:\n{item['instruction']}\n\n### Response:\n"
        response = item['output'] + self.tokenizer.eos_token
        
        # 分别 tokenize
        prompt_ids = self.tokenizer.encode(prompt, add_special_tokens=False)
        response_ids = self.tokenizer.encode(response, add_special_tokens=False)
        
        # 拼接
        input_ids = prompt_ids + response_ids
        input_ids = input_ids[:self.max_length]
        
        # 构造 labels: prompt 部分为 -100, response 部分为 token id
        labels = [-100] * len(prompt_ids) + response_ids
        labels = labels[:self.max_length]
        
        # Padding
        pad_len = self.max_length - len(input_ids)
        input_ids = input_ids + [self.tokenizer.pad_token_id] * pad_len
        labels = labels + [-100] * pad_len
        attention_mask = [1] * (self.max_length - pad_len) + [0] * pad_len
        
        return {
            "input_ids": torch.tensor(input_ids, dtype=torch.long),
            "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
            "labels": torch.tensor(labels, dtype=torch.long),
        }


# 训练循环
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

training_args = TrainingArguments(
    output_dir="./sft_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    bf16=True,
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=sft_dataset,
)
trainer.train()

2.5 训练结果

Epoch	Train Loss	Eval Loss	PPL
1	1.82	1.75	5.75
2	1.45	1.52	4.57
3	1.21	1.58	4.86

注意：Epoch 3 的 eval loss 上升，说明开始过拟合。InstructGPT 论文建议 SFT 只训练 1 epoch，避免过拟合标注者的写作风格。

3. Stage 2: Reward Model

3.1 Bradley-Terry Loss 核心

目标：学习一个标量奖励函数 $r_\theta(x, y)$，使得人类偏好的回复获得更高分。

Bradley-Terry 偏好模型： $$P(y_w \succ y_l \mid x) = \sigma\left(r_\theta(x, y_w) - r_\theta(x, y_l)\right)$$
RM 损失函数： $$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(r_\theta(x, y_w) - r_\theta(x, y_l)\right)\right]$$

直觉解读：

$y_w$: 人类标注为更优的 (winner) 回复
$y_l$: 人类标注为较差的 (loser) 回复
$\sigma$: sigmoid 函数，将分差映射到 [0, 1] 概率
损失函数推动 $r(x, y_w) > r(x, y_l)$（拉大分差）

3.2 RM 架构代码 (RewardModel)

import torch
import torch.nn as nn
from transformers import AutoModel

class RewardModel(nn.Module):
    """奖励模型：基于 SFT 模型初始化，移除 LM head，添加 scalar head"""
    
    def __init__(self, base_model_name, sft_checkpoint=None):
        super().__init__()
        # 加载 base transformer (不需要 LM head)
        self.backbone = AutoModel.from_pretrained(base_model_name)
        hidden_size = self.backbone.config.hidden_size
        
        # 标量奖励头：无 bias (InstructGPT 设计)
        self.reward_head = nn.Linear(hidden_size, 1, bias=False)
        
        # 从 SFT checkpoint 初始化 backbone
        if sft_checkpoint:
            sft_state = torch.load(sft_checkpoint, map_location="cpu")
            self.backbone.load_state_dict(sft_state, strict=False)
    
    def forward(self, input_ids, attention_mask):
        """返回序列末尾 token 的 scalar reward"""
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        hidden_states = outputs.last_hidden_state  # (B, T, H)
        
        # 取最后一个非 pad token 的 hidden state
        # 方法: 用 attention_mask 找到每个序列的最后位置
        seq_lengths = attention_mask.sum(dim=1) - 1  # (B,)
        batch_idx = torch.arange(hidden_states.size(0), device=hidden_states.device)
        last_hidden = hidden_states[batch_idx, seq_lengths]  # (B, H)
        
        reward = self.reward_head(last_hidden).squeeze(-1)  # (B,)
        return reward


class RMTrainer:
    """RM 训练器：实现 Bradley-Terry pairwise loss"""
    
    def compute_loss(self, model, batch):
        # batch 包含 chosen 和 rejected 对
        r_chosen = model(
            input_ids=batch["chosen_input_ids"],
            attention_mask=batch["chosen_attention_mask"],
        )
        r_rejected = model(
            input_ids=batch["rejected_input_ids"],
            attention_mask=batch["rejected_attention_mask"],
        )
        
        # Bradley-Terry loss
        loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()
        
        # 记录 accuracy
        accuracy = (r_chosen > r_rejected).float().mean()
        
        return loss, {"accuracy": accuracy.item()}

3.3 关键设计选择

设计	选择	原因
初始化	从 SFT model 初始化	RM 需要理解语言，SFT 已具备
训练轮数	仅 1 epoch	避免过拟合偏好标注噪声
Reward head	Linear(H, 1, bias=False)	无 bias 避免 reward 偏移
取值位置	最后一个 token	看到完整回复后给出评分
模型大小	6B (InstructGPT)	比 policy (175B) 小，节省计算
数据格式	Pairwise comparison	相对排序比绝对评分一致性更高

3.4 偏好数据集

数据集	规模	标注方式	特点
InstructGPT comparisons	33K	人工排序 4-9 个回复	高质量, 内部标注团队
Anthropic HH-RLHF	170K	人工选择 chosen/rejected	有害性 + 有用性
OpenAssistant (OASST)	90K	众包排名	多语言, 开源
UltraFeedback	64K	GPT-4 评分	AI 标注, 成本低
Stanford SHP	385K	Reddit 投票	自然偏好信号

4. Stage 3: RLHF via PPO 核心

4.1 RLHF Objective

RLHF 优化目标： $$\text{objective}(\phi) = \mathbb{E}_{(x,y) \sim \pi_\phi^{RL}}\left[r_\theta(x,y) - \beta \log\frac{\pi_\phi^{RL}(y|x)}{\pi^{SFT}(y|x)}\right]$$

符号	含义	角色
$\pi_\phi^{RL}$	当前 RL 策略（actor）	正在优化的模型
$\pi^{SFT}$	SFT 模型（frozen reference）	KL 锚点，防止偏移
$r_\theta(x,y)$	奖励模型打分	衡量回复质量
$\beta$	KL 惩罚系数	控制探索与保守的平衡
$\log\frac{\pi_\phi^{RL}}{\pi^{SFT}}$	Per-token KL 散度	惩罚偏离参考策略

等价形式（展开 KL）：

\text{objective}(\phi) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\phi}\left[r_\theta(x,y)\right] - \beta \cdot D_{KL}\left(\pi_\phi^{RL} \| \pi^{SFT}\right)

4.2 PPO Clipped Surrogate

PPO-Clip 目标： $$L^{CLIP}(\phi) = -\mathbb{E}_t\left[\min\left(r_t(\phi) \hat{A}_t,\; \text{clip}\left(r_t(\phi),\; 1-\varepsilon,\; 1+\varepsilon\right) \hat{A}_t\right)\right]$$

符号	定义	说明
$r_t(\phi)$	$\frac{\pi_\phi(a_t\|s_t)}{\pi_{\phi_{\text{old}}}(a_t\|s_t)}$	重要性采样比 (IS ratio)
$\hat{A}_t$	GAE advantage estimate	当前 action 相对 baseline 的优势
$\varepsilon$	0.2 (常用值)	clip 范围，限制策略更新幅度
$\text{clip}(r_t, 1-\varepsilon, 1+\varepsilon)$	将比值裁剪到 [0.8, 1.2]	防止单步更新过大

PPO-Clip 直觉：

如果 $\hat{A}_t > 0$（好 action），允许 $r_t$ 增大但不超过 $1+\varepsilon$
如果 $\hat{A}_t < 0$（坏 action），允许 $r_t$ 减小但不低于 $1-\varepsilon$
效果：每步策略更新被限制在一个"信任区域"内 → 训练更稳定

4.3 ActorCritic 架构代码

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM

class ActorCritic(nn.Module):
    """PPO Actor-Critic: Actor (policy) + Critic (value function)"""
    
    def __init__(self, model_name, sft_checkpoint=None):
        super().__init__()
        # Actor: 语言模型策略 π_φ
        self.actor = AutoModelForCausalLM.from_pretrained(model_name)
        if sft_checkpoint:
            self.actor.load_state_dict(torch.load(sft_checkpoint))
        
        # Critic: 价值函数 V(s)，共享 backbone 或独立
        self.critic_backbone = AutoModelForCausalLM.from_pretrained(model_name)
        hidden_size = self.critic_backbone.config.hidden_size
        self.value_head = nn.Linear(hidden_size, 1)
    
    def get_policy(self, input_ids, attention_mask):
        """返回 token 级别的 log probabilities"""
        outputs = self.actor(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        logits = outputs.logits  # (B, T, V)
        log_probs = torch.log_softmax(logits, dim=-1)
        return log_probs
    
    def get_value(self, input_ids, attention_mask):
        """返回每个 token 位置的 value estimate"""
        outputs = self.critic_backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        hidden = outputs.hidden_states[-1]  # (B, T, H)
        values = self.value_head(hidden).squeeze(-1)  # (B, T)
        return values
    
    def forward(self, input_ids, attention_mask):
        log_probs = self.get_policy(input_ids, attention_mask)
        values = self.get_value(input_ids, attention_mask)
        return log_probs, values

4.4 PPO 训练循环代码

import torch
import torch.nn.functional as F

class PPOTrainer:
    """PPO 训练器 (简化版)"""
    
    def __init__(self, actor_critic, ref_model, reward_model, 
                 kl_coeff=0.1, clip_eps=0.2, gamma=1.0, lam=0.95):
        self.ac = actor_critic
        self.ref_model = ref_model       # frozen SFT model
        self.rm = reward_model           # frozen reward model
        self.kl_coeff = kl_coeff         # β
        self.clip_eps = clip_eps         # ε
        self.gamma = gamma
        self.lam = lam                   # GAE lambda
    
    @torch.no_grad()
    def rollout(self, prompts, max_new_tokens=256):
        """Phase 1: 用当前策略采样回复"""
        responses = []
        for prompt_ids in prompts:
            # 自回归生成
            output = self.ac.actor.generate(
                input_ids=prompt_ids.unsqueeze(0),
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
            )
            responses.append(output[0])
        return responses
    
    @torch.no_grad()
    def compute_rewards(self, prompt_ids, response_ids):
        """计算 reward = RM score - β * per-token KL"""
        full_ids = torch.cat([prompt_ids, response_ids], dim=1)
        attention_mask = (full_ids != 0).long()
        
        # RM score (sequence-level)
        rm_score = self.rm(full_ids, attention_mask)
        
        # Per-token KL penalty
        policy_logprobs = self.ac.get_policy(full_ids, attention_mask)
        ref_logprobs = self.ref_model(full_ids, attention_mask).log_softmax(-1)
        
        # KL(π_φ || π_ref) per token
        kl_per_token = (policy_logprobs.exp() * (policy_logprobs - ref_logprobs)).sum(-1)
        
        # Token-level reward: 0 everywhere except last token gets RM score
        # minus KL penalty at every token
        rewards = -self.kl_coeff * kl_per_token
        # Add RM score at the end of response
        response_end = attention_mask.sum(-1) - 1
        rewards[0, response_end] += rm_score
        
        return rewards
    
    def compute_gae(self, rewards, values, mask):
        """Generalized Advantage Estimation"""
        advantages = torch.zeros_like(rewards)
        last_gae = 0
        
        for t in reversed(range(rewards.size(1))):
            if t == rewards.size(1) - 1:
                next_value = 0
            else:
                next_value = values[:, t + 1]
            
            delta = rewards[:, t] + self.gamma * next_value - values[:, t]
            advantages[:, t] = last_gae = delta + self.gamma * self.lam * last_gae
            advantages[:, t] *= mask[:, t]
        
        returns = advantages + values
        return advantages, returns
    
    def ppo_update(self, batch, n_epochs=4):
        """Phase 2: PPO policy gradient update"""
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        old_log_probs = batch["old_log_probs"]
        advantages = batch["advantages"]
        returns = batch["returns"]
        response_mask = batch["response_mask"]
        
        for epoch in range(n_epochs):
            # Current policy log probs
            log_probs, values = self.ac(input_ids, attention_mask)
            
            # Gather log probs of actual tokens
            action_log_probs = torch.gather(
                log_probs[:, :-1], 2, input_ids[:, 1:].unsqueeze(-1)
            ).squeeze(-1)
            
            # Importance sampling ratio
            ratio = torch.exp(action_log_probs - old_log_probs)
            
            # PPO Clipped objective
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * advantages
            policy_loss = -torch.min(surr1, surr2)
            policy_loss = (policy_loss * response_mask).sum() / response_mask.sum()
            
            # Value loss
            value_loss = F.mse_loss(values[:, :-1] * response_mask, 
                                     returns * response_mask)
            
            # Total loss
            loss = policy_loss + 0.5 * value_loss
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.ac.parameters(), 1.0)
            self.optimizer.step()
            self.optimizer.zero_grad()
        
        return {"policy_loss": policy_loss.item(), "value_loss": value_loss.item()}

4.5 KL 惩罚的作用

为什么需要 KL 惩罚？ — 防止 Reward Hacking

没有 KL 约束时，策略会找到 RM 的漏洞（adversarial examples）获得高分但输出垃圾：

现象	无 KL 约束	有 KL 约束
输出质量	重复、无意义但 RM 给高分	流畅、有意义的回答
RM score	极高（但虚假）	适度高（真实偏好）
KL 距离	无限增大	受控 (通常 < 10 nats)
多样性	模式坍缩	保持多样性

$\beta$ 的选择：

$\beta$ 过大 → 策略几乎不更新，等于 SFT
$\beta$ 过小 → reward hacking，输出退化
InstructGPT 使用自适应 $\beta$：当 KL 超过目标时增大 $\beta$，低于目标时减小 $\beta$

4.6 训练结果

训练步数	RM Score	KL (nats)	Win Rate vs SFT
0 (SFT init)	2.1	0.0	50%
500	3.2	2.8	62%
1000	3.8	5.1	68%
2000	4.1	7.3	71%
5000	4.3	9.5	72%

观察：RM score 和 win rate 同步增长但边际递减；KL 持续增大说明策略在不断偏离 SFT。需要在 "足够好" 时停止训练。

5. 完整超参数表

超参数	SFT	RM	PPO
Base model	GPT-3 175B	GPT-3 6B	SFT 175B
Learning rate	2e-5	9e-6	1.5e-5 (actor) / 5e-6 (critic)
Batch size	32	64	512 (rollout) / 64 (update)
Epochs	1 (16 for small)	1	4 (PPO epochs per batch)
Scheduler	Cosine	Cosine	Cosine
Max seq length	2048	2048	512 (prompt) + 256 (response)
Warmup ratio	3%	5%	0
KL coeff ($\beta$)	N/A	N/A	0.02 (adaptive)
Clip $\varepsilon$	N/A	N/A	0.2
GAE $\lambda$	N/A	N/A	0.95
Discount $\gamma$	N/A	N/A	1.0
Grad clip	1.0	1.0	1.0
数据量	~13K demos	~33K comparisons	~31K prompts

6. 定性对比

Prompt："Explain the moon landing to a 6 year old in a way that is inspiring."

模型	输出示例	评价
Base GPT-3	"Explain the sun to a 6 year old. Explain gravity to a 6 year old. Explain..."	重复 prompt 模式，无法遵循指令
SFT	"So, a long time ago, people wanted to go to the moon. They built a really big rocket and three brave astronauts went inside..."	正确格式，但平淡，缺乏感染力
PPO (RLHF)	"Imagine you could jump SO high that you could touch the stars! Well, some very brave people actually did something like that..."	生动、有感染力、适合儿童

关键观察：

Base → SFT: 获得"遵循指令"能力（格式对了）
SFT → PPO: 回复质量提升（更吸引人、更有创意、更符合人类偏好）
PPO 学到了 "what humans prefer"，不仅仅是 "what is correct"

7. 关键设计选择

#	设计选择	原因
1	RM 从 SFT 初始化	需要语言理解能力来评估回复质量
2	RM 只训 1 epoch	偏好标注有噪声，多 epoch 会过拟合噪声
3	RM 比 policy 小	PPO 中需要同时加载 4 个模型，RM 小可节省显存
4	Pairwise ranking 而非 pointwise	人类对相对比较比绝对评分更一致（标注者间一致性更高）
5	自适应 KL 系数	固定 $\beta$ 难以平衡不同训练阶段；自适应保持 KL 在目标范围
6	混合 pretrain 梯度	InstructGPT 在 PPO 阶段混合 pretrain loss 防止遗忘通用能力

8. 已知问题与局限

问题	描述	缓解方法
Reward Hacking	策略找到 RM 漏洞获得虚假高分（如冗长回答、重复讨好词）	KL 惩罚、增大 RM、ensemble RM
Distribution Shift	RM 在 SFT 输出上训练，但 PPO 策略的输出不断变化，RM 评估不可靠	迭代训练（重新收集偏好数据）、PPO-max
计算成本高	PPO 需同时加载 4 个模型（actor, critic, ref, RM），显存需求 ~4x	LoRA、共享 backbone、DeepSpeed ZeRO
训练不稳定	PPO 对超参数敏感，reward 可能突然坍缩	梯度裁剪、小学习率、reward normalization
标注者偏见	RM 捕获标注者偏好（而非真实 "正确性"）	多样化标注团队、clear guidelines
对齐税 (Alignment Tax)	RLHF 后在某些 NLP benchmarks 上性能下降	混合 pretrain loss、保守 KL

9. 数学公式速查

1. SFT Loss： $$\mathcal{L}_{\text{SFT}} = -\sum_{t \in \text{response}} \log p_\theta(x_t \mid x_{<t})$$

2. Bradley-Terry (RM Loss)： $$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(r_\theta(x, y_w) - r_\theta(x, y_l)\right)\right]$$

3. RLHF Objective： $$\max_\phi \; \mathbb{E}_{x, y \sim \pi_\phi}\left[r_\theta(x,y) - \beta \log\frac{\pi_\phi(y|x)}{\pi^{SFT}(y|x)}\right]$$

4. PPO Clipped Surrogate： $$L^{CLIP} = -\mathbb{E}_t\left[\min\left(r_t \hat{A}_t,\; \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]$$

5. KL 散度 (per-token)： $$D_{KL}(\pi_\phi \| \pi^{SFT}) = \mathbb{E}_{y \sim \pi_\phi}\left[\sum_t \log\frac{\pi_\phi(y_t|y_{<t}, x)}{\pi^{SFT}(y_t|y_{<t}, x)}\right]$$

10. 延伸阅读

InstructGPT 论文：Training language models to follow instructions with human feedback (Ouyang et al., 2022)
TRL (Transformer Reinforcement Learning)：HuggingFace TRL Library — 一站式 SFT + RM + PPO/DPO 训练框架
OpenRLHF：OpenRLHF — 高性能分布式 RLHF 框架 (Ray + vLLM)
DPO (Direct Preference Optimization)：Rafailov et al., 2023 — 无需 RM 和 PPO，直接从偏好数据优化策略
PPO 原论文：Proximal Policy Optimization Algorithms (Schulman et al., 2017)
RLHF 综述：A Survey of Reinforcement Learning from Human Feedback

核心要点总结

三阶段流水线是核心框架：SFT (学格式) → RM (学偏好) → PPO (优化偏好)，每阶段解决不同问题
SFT 只对 response 计算 loss：prompt 标记为 -100，本质是条件语言模型微调
Bradley-Terry 模型将偏好转为分类：$\sigma(r_w - r_l)$ 将标量奖励差映射为偏好概率
KL 惩罚是 RLHF 成功的关键：$\beta \cdot D_{KL}(\pi_\phi \| \pi^{SFT})$ 防止 reward hacking 和模式坍缩
PPO-Clip 保证训练稳定性：限制每步策略更新幅度在 $[1-\varepsilon, 1+\varepsilon]$ 内
PPO 的 4 模型问题是工程瓶颈：actor + critic + ref + RM 需要 ~4x 显存，催生了 DPO 等轻量替代方案