Reiner Pope
lecture 2026-05-04 · Dwarkesh Podcast
Reiner Pope: The Math Behind How LLMs Are Trained and Served
Roofline Model:推理时间 = max(计算时间, 内存时间),batch size 决定权衡 两个核心公式:$t_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}$, $t_{\text{mem}} = \frac{N_{\text{total}}}{B_w} + \frac{B \cdot L \cdot s}{B_w}$ 延迟 vs 成本:延迟有下限,成本随 batch 从双曲趋近常数 Slow Mode 无效:KV cache 无法跨用户共享,batch=1 时成本趋近无穷大 最优 batch size:$B \approx 300 \times \text{sparsity}$(DeepSeek 32/256 → 300×8 ≈ 2400),与模型规模无关 吞吐量:batch=2000,每 20ms 一批,≈ 128K tokens/s HBM 全读一遍 ≈ 15-20ms(A100→H100→B100→Rubin 变化不大) KV Cache 与上下文长度:超过最优长度后迅速变为 memory-bound,MFU 骤降 Sparse Attention:DeepSeek 在 memory fetch 引入 √,改善 $O(n^2)$ scaling MoE Scaling Law:增加 expert 数量仍能提升质量,DeepSeek 新 MoE 更高效 Pipeline bubble:层间等待时间在大 batch 下严重拖累利用率 从 API 定价可反推 KV cache 内存成本结构 Cryptography 趋同演化:NN 和密码学独立发现类似技巧
🔑 Key Insights
"If you do not batch together many users, the cost and the economics you get can be a thousand times worse than if you do batch many users together."
💡 不 batch 用户的成本比 batch 用户高 1000 倍——批处理是推理经济的核心杠杆
"FLOPs over memory bandwidth is a dimensionless constant of around 300 on most GPUs — from A100 to H100 to B100 it has remained remarkably stable."
💡 FLOPs/内存带宽 ≈ 300,A100→H100→B100 变化不大
"Because of RL, models may be 100x over-trained beyond Chinchilla-optimal."
💡 RL 信号比纯文本监督更稀缺,模型训练量可能超出 Chinchilla 最优点 100 倍
"As we now know, pipelining is not wise — pipeline bubbles dominate at scale."
💡 Ilya 指 pipeline 阶段之间的等待气泡在大规模下严重拖累效率
"DeepSeek V3 activates 32 out of 256 experts. This gives a sparsity of 8, which means you need a batch size of roughly 300 × 8 = 2,400 to fully amortize weight fetches."
💡 DeepSeek V3 最优 batch size ≈ 2400-3000,与模型规模无关,只取决于稀疏度
"In neural nets and cryptography, the fields independently converged on similar ideas — both are about manipulating information under constraints."
💡 NN 和密码学独立发现类似技巧,存在更深层的必然性
Summary
- Roofline Model:推理时间 = max(计算时间, 内存时间),batch size 决定权衡
- 两个核心公式:$t_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}$, $t_{\text{mem}} = \frac{N_{\text{total}}}{B_w} + \frac{B \cdot L \cdot s}{B_w}$
- 延迟 vs 成本:延迟有下限,成本随 batch 从双曲趋近常数
- Slow Mode 无效:KV cache 无法跨用户共享,batch=1 时成本趋近无穷大
- 最优 batch size:$B \approx 300 \times \text{sparsity}$(DeepSeek 32/256 → 300×8 ≈ 2400),与模型规模无关
- 吞吐量:batch=2000,每 20ms 一批,≈ 128K tokens/s
- HBM 全读一遍 ≈ 15-20ms(A100→H100→B100→Rubin 变化不大)
- KV Cache 与上下文长度:超过最优长度后迅速变为 memory-bound,MFU 骤降
- Sparse Attention:DeepSeek 在 memory fetch 引入 √,改善 $O(n^2)$ scaling
- MoE Scaling Law:增加 expert 数量仍能提升质量,DeepSeek 新 MoE 更高效
- Pipeline bubble:层间等待时间在大 batch 下严重拖累利用率
- 从 API 定价可反推 KV cache 内存成本结构
- Cryptography 趋同演化:NN 和密码学独立发现类似技巧