Reiner Pope: The Math Behind How LLMs Are Trained and Served

Roofline Model：推理时间 = max(计算时间, 内存时间)，batch size 决定权衡两个核心公式：$t_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}$, $t_{\text{mem}} = \frac{N_{\text{total}}}{B_w} + \frac{B \cdot L \cdot s}{B_w}$ 延迟 vs 成本：延迟有下限，成本随 batch 从双曲趋近常数 Slow Mode 无效：KV cache 无法跨用户共享，batch=1 时成本趋近无穷大最优 batch size：$B \approx 300 \times \text{sparsity}$（DeepSeek 32/256 → 300×8 ≈ 2400），与模型规模无关吞吐量：batch=2000，每 20ms 一批，≈ 128K tokens/s HBM 全读一遍 ≈ 15-20ms（A100→H100→B100→Rubin 变化不大） KV Cache 与上下文长度：超过最优长度后迅速变为 memory-bound，MFU 骤降 Sparse Attention：DeepSeek 在 memory fetch 引入 √，改善 $O(n^2)$ scaling MoE Scaling Law：增加 expert 数量仍能提升质量，DeepSeek 新 MoE 更高效 Pipeline bubble：层间等待时间在大 batch 下严重拖累利用率从 API 定价可反推 KV cache 内存成本结构 Cryptography 趋同演化：NN 和密码学独立发现类似技巧

🔑 Key Insights

"If you do not batch together many users, the cost and the economics you get can be a thousand times worse than if you do batch many users together."

💡 不 batch 用户的成本比 batch 用户高 1000 倍——批处理是推理经济的核心杠杆

"FLOPs over memory bandwidth is a dimensionless constant of around 300 on most GPUs — from A100 to H100 to B100 it has remained remarkably stable."

💡 FLOPs/内存带宽 ≈ 300，A100→H100→B100 变化不大

"Because of RL, models may be 100x over-trained beyond Chinchilla-optimal."

💡 RL 信号比纯文本监督更稀缺，模型训练量可能超出 Chinchilla 最优点 100 倍

"As we now know, pipelining is not wise — pipeline bubbles dominate at scale."

💡 Ilya 指 pipeline 阶段之间的等待气泡在大规模下严重拖累效率

"DeepSeek V3 activates 32 out of 256 experts. This gives a sparsity of 8, which means you need a batch size of roughly 300 × 8 = 2,400 to fully amortize weight fetches."

💡 DeepSeek V3 最优 batch size ≈ 2400-3000，与模型规模无关，只取决于稀疏度

"In neural nets and cryptography, the fields independently converged on similar ideas — both are about manipulating information under constraints."

💡 NN 和密码学独立发现类似技巧，存在更深层的必然性

Summary

Roofline Model：推理时间 = max(计算时间, 内存时间)，batch size 决定权衡
两个核心公式：$t_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}$, $t_{\text{mem}} = \frac{N_{\text{total}}}{B_w} + \frac{B \cdot L \cdot s}{B_w}$
延迟 vs 成本：延迟有下限，成本随 batch 从双曲趋近常数
Slow Mode 无效：KV cache 无法跨用户共享，batch=1 时成本趋近无穷大
最优 batch size：$B \approx 300 \times \text{sparsity}$（DeepSeek 32/256 → 300×8 ≈ 2400），与模型规模无关
吞吐量：batch=2000，每 20ms 一批，≈ 128K tokens/s
HBM 全读一遍 ≈ 15-20ms（A100→H100→B100→Rubin 变化不大）
KV Cache 与上下文长度：超过最优长度后迅速变为 memory-bound，MFU 骤降
Sparse Attention：DeepSeek 在 memory fetch 引入 √，改善 $O(n^2)$ scaling
MoE Scaling Law：增加 expert 数量仍能提升质量，DeepSeek 新 MoE 更高效
Pipeline bubble：层间等待时间在大 batch 下严重拖累利用率
从 API 定价可反推 KV cache 内存成本结构
Cryptography 趋同演化：NN 和密码学独立发现类似技巧