Lab / Thinkers / Reiner Pope
Reiner Pope
lecture 2026-05-04 · Dwarkesh Podcast

Reiner Pope: The Math Behind How LLMs Are Trained and Served

Roofline Model:推理时间 = max(计算时间, 内存时间),batch size 决定权衡 两个核心公式:$t_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}$, $t_{\text{mem}} = \frac{N_{\text{total}}}{B_w} + \frac{B \cdot L \cdot s}{B_w}$ 延迟 vs 成本:延迟有下限,成本随 batch 从双曲趋近常数 Slow Mode 无效:KV cache 无法跨用户共享,batch=1 时成本趋近无穷大 最优 batch size:$B \approx 300 \times \text{sparsity}$(DeepSeek 32/256 → 300×8 ≈ 2400),与模型规模无关 吞吐量:batch=2000,每 20ms 一批,≈ 128K tokens/s HBM 全读一遍 ≈ 15-20ms(A100→H100→B100→Rubin 变化不大) KV Cache 与上下文长度:超过最优长度后迅速变为 memory-bound,MFU 骤降 Sparse Attention:DeepSeek 在 memory fetch 引入 √,改善 $O(n^2)$ scaling MoE Scaling Law:增加 expert 数量仍能提升质量,DeepSeek 新 MoE 更高效 Pipeline bubble:层间等待时间在大 batch 下严重拖累利用率 从 API 定价可反推 KV cache 内存成本结构 Cryptography 趋同演化:NN 和密码学独立发现类似技巧
🔑 Key Insights
"If you do not batch together many users, the cost and the economics you get can be a thousand times worse than if you do batch many users together."

💡 不 batch 用户的成本比 batch 用户高 1000 倍——批处理是推理经济的核心杠杆

"FLOPs over memory bandwidth is a dimensionless constant of around 300 on most GPUs — from A100 to H100 to B100 it has remained remarkably stable."

💡 FLOPs/内存带宽 ≈ 300,A100→H100→B100 变化不大

"Because of RL, models may be 100x over-trained beyond Chinchilla-optimal."

💡 RL 信号比纯文本监督更稀缺,模型训练量可能超出 Chinchilla 最优点 100 倍

"As we now know, pipelining is not wise — pipeline bubbles dominate at scale."

💡 Ilya 指 pipeline 阶段之间的等待气泡在大规模下严重拖累效率

"DeepSeek V3 activates 32 out of 256 experts. This gives a sparsity of 8, which means you need a batch size of roughly 300 × 8 = 2,400 to fully amortize weight fetches."

💡 DeepSeek V3 最优 batch size ≈ 2400-3000,与模型规模无关,只取决于稀疏度

"In neural nets and cryptography, the fields independently converged on similar ideas — both are about manipulating information under constraints."

💡 NN 和密码学独立发现类似技巧,存在更深层的必然性

  • Roofline Model:推理时间 = max(计算时间, 内存时间),batch size 决定权衡
  • 两个核心公式:$t_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}$, $t_{\text{mem}} = \frac{N_{\text{total}}}{B_w} + \frac{B \cdot L \cdot s}{B_w}$
  • 延迟 vs 成本:延迟有下限,成本随 batch 从双曲趋近常数
  • Slow Mode 无效:KV cache 无法跨用户共享,batch=1 时成本趋近无穷大
  • 最优 batch size:$B \approx 300 \times \text{sparsity}$(DeepSeek 32/256 → 300×8 ≈ 2400),与模型规模无关
  • 吞吐量:batch=2000,每 20ms 一批,≈ 128K tokens/s
  • HBM 全读一遍 ≈ 15-20ms(A100→H100→B100→Rubin 变化不大)
  • KV Cache 与上下文长度:超过最优长度后迅速变为 memory-bound,MFU 骤降
  • Sparse Attention:DeepSeek 在 memory fetch 引入 √,改善 $O(n^2)$ scaling
  • MoE Scaling Law:增加 expert 数量仍能提升质量,DeepSeek 新 MoE 更高效
  • Pipeline bubble:层间等待时间在大 batch 下严重拖累利用率
  • 从 API 定价可反推 KV cache 内存成本结构
  • Cryptography 趋同演化:NN 和密码学独立发现类似技巧