Lecture 05: Transformer 架构

核心主题：Self-Attention、Multi-Head Attention、位置编码、Encoder-Decoder、机器翻译

1. 从 RNN 到 Attention 的动机

RNN 的核心问题：

顺序计算：时间复杂度 $O(n)$，无法并行化 — GPU 利用率低
梯度乘法链：长距离依赖需要梯度经过多次矩阵乘法，导致梯度消失/爆炸
信息瓶颈：所有历史信息压缩到固定维度的隐状态中

Attention 直觉："Bank of the river" 示例

理解 "bank" 的含义需要看整个句子上下文。Attention 允许每个位置直接关注所有其他位置：

z_1 = \sum_{j=1}^4 w_{1j} \cdot x_j, \quad w_{1j} = \text{softmax}(x_1 \cdot x_j)

$x_j$ 是位置 $j$ 的输入表示
$w_{1j}$ 是位置 1 对位置 $j$ 的注意力权重
$z_1$ 是位置 1 的上下文感知输出

关键洞察：Attention 直接计算任意两个位置间的关系，绕过了 RNN 的顺序传播瓶颈。

2. Self-Attention 机制 ⭐⭐

2.1 三组投影矩阵 (Q, K, V)

对每个输入 $x_i \in \mathbb{R}^{d_{\text{model}}}$，通过三个可学习矩阵投影：

Query: $q_i = W^Q x_i$ — "我在找什么"
Key: $k_i = W^K x_i$ — "我包含什么信息"
Value: $v_i = W^V x_i$ — "如果匹配，我提供什么"

维度：$W^Q, W^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$，$W^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$

直觉：将 Q/K/V 分离允许模型学习不同的"查询"和"被查询"表示，比直接用原始向量做点积更灵活。

2.2 缩放点积注意力 (Scaled Dot-Product Attention)

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

为什么除以 $\sqrt{d_k}$？

设 $q, k$ 各分量独立且均值为 0、方差为 1
则 $q \cdot k = \sum_{i=1}^{d_k} q_i k_i$ 的方差为 $d_k$
当 $d_k$ 较大时，点积值会很大，softmax 会进入梯度极小的饱和区
除以 $\sqrt{d_k}$ 将方差归一化为 1，保持 softmax 在有效梯度区域

2.3 计算流程示例 (位置 3)

假设序列长度 $n=4$，计算位置 3 的输出 $z_3$：

计算 $q_3 = W^Q x_3$
计算所有 key：$k_1, k_2, k_3, k_4$
计算注意力分数：$\alpha_{3j} = \frac{q_3 \cdot k_j}{\sqrt{d_k}}, \quad j=1,2,3,4$
Softmax 归一化：$w_{3j} = \frac{e^{\alpha_{3j}}}{\sum_{l=1}^4 e^{\alpha_{3l}}}$
加权求和：$z_3 = \sum_{j=1}^4 w_{3j} \cdot v_j$

3. Multi-Head Attention ⭐

3.1 动机

单头注意力只能学习一种"关注模式"
多头允许模型同时关注不同类型的关系（语法、语义、位置等）
类似于 CNN 中多个 filter 捕获不同特征

3.2 公式

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H) \cdot W^O

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

标准配置：

$H = 8$ 个头
$d_k = d_v = d_{\text{model}} / H = 512 / 8 = 64$
$W^O \in \mathbb{R}^{Hd_v \times d_{\text{model}}} = \mathbb{R}^{512 \times 512}$

计算量不变：虽然有 $H$ 个头，但每个头的维度缩小为 $d_k = d_{\text{model}}/H$，总计算量与单头 $d_{\text{model}}$ 维注意力相当。

4. Transformer Block

每个 Transformer Block 包含 4 个步骤：

Step 1: Multi-Head Attention

$u'_i = \text{MultiHead}(x_i, X, X)$，其中 $x_i$ 提供 Q，$X$ 提供 K 和 V。

Step 2: Add & Norm (残差连接 + 层归一化)

u_i = \text{LayerNorm}(x_i + u'_i; \gamma_1, \beta_1)

Step 3: Position-wise Feed-Forward Network (FFN)

z'_i = W_2^T \cdot \text{ReLU}(W_1^T \cdot u_i)

其中 $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$，$W_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$，$d_{ff} = 2048$。

Step 4: Add & Norm

z_i = \text{LayerNorm}(u_i + z'_i; \gamma_2, \beta_2)

LayerNorm 定义

\text{LayerNorm}(x; \gamma, \beta) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

其中 $\mu, \sigma^2$ 是对单个样本各维度计算的均值和方差（不同于 BatchNorm 跨 batch 计算）。

5. 完整 Transformer 架构

5.1 超参数表

超参数	符号	值
模型维度	$d_{\text{model}}$	512
Key/Value 维度	$d_k = d_v$	64
注意力头数	$H$	8
FFN 隐层维度	$d_{ff}$	2048
Encoder/Decoder 层数	$L$	6
Dropout	$p$	0.1

5.2 Encoder

输入：源序列 token embeddings + positional encoding
结构：$L=6$ 个相同的 Transformer Block 堆叠
每个 block：Self-Attention → Add&Norm → FFN → Add&Norm
输出：每个位置的上下文表示 $\{h_1, \ldots, h_n\}$

5.3 Decoder

每个 Decoder Block 有 3 个子层：

Masked Self-Attention：只看已生成的 token（防止信息泄露）
Cross-Attention：Q 来自 decoder，K/V 来自 encoder 输出
FFN：与 encoder 相同的前馈网络

每个子层后都接 Add & Norm。

5.4 Causal Mask 代码

import numpy as np

def causal_mask(size):
    """生成因果掩码，防止 decoder 看到未来的 token"""
    # 上三角为 -inf，对角线及以下为 0
    mask = np.triu(np.ones((size, size)), k=1)
    mask[mask == 1] = -np.inf
    return mask

# 示例：序列长度为 4
print(causal_mask(4))
# [[  0. -inf -inf -inf]
#  [  0.   0. -inf -inf]
#  [  0.   0.   0. -inf]
#  [  0.   0.   0.   0.]]

效果：将掩码加到注意力分数上，$-\infty$ 经过 softmax 后变为 0，确保位置 $i$ 只能注意到 $\leq i$ 的位置。

6. 位置编码 (Positional Encoding)

6.1 正弦位置编码公式

由于 Self-Attention 对位置无感知（排列不变性），需要注入位置信息：

PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

其中 $pos$ 为位置索引，$i$ 为维度索引（$0 \leq i < d_{\text{model}}/2$）。

6.2 性质

唯一性：每个位置有唯一的编码向量
有界性：所有值在 $[-1, 1]$ 之间
线性可表达相对位置：$PE(pos+k)$ 可以表示为 $PE(pos)$ 的线性函数
可推广到训练未见过的长度：不依赖学习参数
低维度 = 低频：捕获粗粒度位置；高维度 = 高频：捕获精细位置

使用方式：$\text{input}_i = \text{Embedding}(token_i) + PE(i)$，直接加到 token embedding 上。

7. 时间复杂度对比

模型	每层复杂度	顺序操作数	最大路径长度
Self-Attention	$O(n^2 \cdot d)$	$O(1)$	$O(1)$
RNN	$O(n \cdot d^2)$	$O(n)$	$O(n)$
CNN (kernel $k$)	$O(k \cdot n \cdot d^2)$	$O(1)$	$O(\log_k n)$

权衡：Self-Attention 的 $O(n^2)$ 在长序列时成为瓶颈，但对于常见长度（$n < d$），并行性和短路径带来巨大优势。

8. 模型大小计算

Transformer Base 模型参数量估算：

组件	参数量	计算
Multi-Head Attention	$4 \cdot d_{\text{model}}^2$	$4 \times 512^2 = 1,048,576$
FFN	$2 \cdot d_{\text{model}} \cdot d_{ff}$	$2 \times 512 \times 2048 = 2,097,152$
LayerNorm (x2)	$4 \cdot d_{\text{model}}$	$4 \times 512 = 2,048$
单层总计	-	$\approx 3.15M$
Encoder (6层)	-	$\approx 18.9M$
Decoder (6层)	-	$\approx 24.5M$ (含 cross-attn)
Embedding	$\|V\| \cdot d_{\text{model}}$	$37000 \times 512 \approx 19M$
总计 (Base)	-	$\approx 65M$

9. 训练技巧

9.1 学习率调度 (Warmup + Inverse Sqrt Decay)

lr = d_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \; \text{step} \cdot \text{warmup\_steps}^{-1.5})

前 $\text{warmup\_steps}$ 步线性增长
之后按步数的 $-0.5$ 次方衰减
原文 warmup_steps = 4000

9.2 Label Smoothing

将 one-hot 目标分布替换为：正确类 $1 - \epsilon$，其余类均分 $\epsilon / (|V|-1)$
原文 $\epsilon = 0.1$
轻微降低 perplexity 但提升 BLEU（鼓励模型不过于自信）

9.3 Embedding 缩放

将 embedding 乘以 $\sqrt{d_{\text{model}}}$，使得 embedding 值的量级与位置编码相当：

\text{input}_i = \sqrt{d_{\text{model}}} \cdot \text{Embedding}(token_i) + PE(i)

9.4 Weight Sharing

Encoder embedding 和 Decoder embedding 共享权重
Decoder embedding 和最终 linear 层（output projection）共享权重
减少参数量，作为正则化

10. 核心代码实现

Attention 函数

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def attention(query, key, value, mask=None, dropout=None):
    """Scaled Dot-Product Attention"""
    d_k = query.size(-1)
    # (batch, heads, seq_len, d_k) x (batch, heads, d_k, seq_len)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    attn_weights = F.softmax(scores, dim=-1)
    
    if dropout is not None:
        attn_weights = dropout(attn_weights)
    
    return torch.matmul(attn_weights, value), attn_weights

MultiHeadedAttention 类

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        super().__init__()
        assert d_model % h == 0
        self.d_k = d_model // h
        self.h = h
        # 4 个线性层: W^Q, W^K, W^V, W^O
        self.linears = nn.ModuleList([
            nn.Linear(d_model, d_model) for _ in range(4)
        ])
        self.dropout = nn.Dropout(p=dropout)
        self.attn = None  # 存储注意力权重用于可视化
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # 1) 线性投影并 reshape 为多头
        # (batch, seq_len, d_model) -> (batch, h, seq_len, d_k)
        query, key, value = [
            lin(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears[:3], (query, key, value))
        ]
        
        # 2) 计算注意力
        x, self.attn = attention(query, key, value, 
                                  mask=mask, dropout=self.dropout)
        
        # 3) 拼接并通过最终线性层
        # (batch, h, seq_len, d_k) -> (batch, seq_len, d_model)
        x = x.transpose(1, 2).contiguous().view(
            batch_size, -1, self.h * self.d_k
        )
        return self.linears[-1](x)

PositionwiseFeedForward

class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))

11. 实验结果 (WMT14)

BLEU 分数对比

模型	EN-DE BLEU	EN-FR BLEU	训练代价 (FLOPs)
GNMT + RL	24.6	39.92	$2.3 \times 10^{19}$
ConvS2S	25.16	40.46	$9.6 \times 10^{18}$
Transformer (Base)	27.3	38.1	$3.3 \times 10^{18}$
Transformer (Big)	28.4	41.0	$2.3 \times 10^{19}$

Ablation Study 关键发现

减少注意力头数（如 $H=1$）：BLEU 下降 0.9
减少 $d_k$：降低性能，说明注意力需要足够的表达能力
增大模型（$d_{\text{model}}=1024, H=16$）：继续提升
Dropout 对正则化至关重要，$p=0.1$ 最优
正弦位置编码 vs. 学习位置编码：性能几乎相同

12. Transformer 中的三种 Attention 类型

类型	位置	Q 来源	K/V 来源	Mask
Encoder Self-Attention	Encoder	Encoder 当前层	Encoder 当前层	无 (全局可见)
Masked Self-Attention	Decoder	Decoder 当前层	Decoder 当前层	Causal mask (只看过去)
Cross-Attention	Decoder	Decoder 当前层	Encoder 最终输出	无

记忆方法：Q 决定"谁在问"，K/V 决定"从哪里找答案"。Cross-Attention 让 decoder 从 encoder（源语言）中提取信息。

核心要点总结

Self-Attention 是核心创新：每个位置直接关注所有位置，$O(1)$ 最大路径长度，彻底解决 RNN 的长距离依赖问题
缩放因子 $\sqrt{d_k}$ 防止点积过大导致 softmax 饱和，是数值稳定性的关键
Multi-Head = 多视角：$H$ 个独立的注意力头捕获不同类型的依赖关系，总计算量不增加
位置编码弥补排列不变性：正弦编码无需学习、可外推；现代模型多用 RoPE 等改进方案
残差连接 + LayerNorm 是深层 Transformer 可训练的基础，确保梯度流通和训练稳定
Encoder-Decoder 通过 Cross-Attention 交互：decoder 的 Q 查询 encoder 的 K/V，实现条件生成
"Attention Is All You Need"：完全抛弃 RNN/CNN，仅靠注意力机制即可达到 SOTA，且训练速度大幅提升

← 上一章: Neural LMs 下一章: GPTs →