Scaling Laws, Carefully
Main idea
In a long essay, Lilian Weng reconstructs the history of neural scaling laws from Amari and Hestness through Kaplan (2020) and Chinchilla (2022), arguing that the famous disagreement between them (model-heavy vs. balanced model/data scaling) largely stems from a methodological artifact: Kaplan used smaller models where embedding parameters dominate the total parameter count, biasing the curve fit.
Context
It lands at a moment when frontier labs (OpenAI GPT-5.5, Anthropic Claude Opus 4.8, Google Gemini 3.5) have moved to MoE architectures and invest billions in training runs where computing 'optimal model size vs. data' determines compute ROI. At the same time the industry is hitting the 'data wall' β train tokens are limited and repeated tokens give diminishing returns. Weng publishes rarely β this is her first long technical post in several months.
Why it matters
For ML engineers and researchers this will be a reference post cited in pretraining and compute-allocation debates. For decision makers at large labs the important takeaway is that scaling-law coefficients are surprisingly sensitive to procedural choices (how parameters are counted, rounding, precision) β so billion-dollar bets built on them carry more uncertainty than is intuitive.
Details / arguments
- The Kaplan vs. Chinchilla gap is largely about whether embedding weights count in the parameter total
- In today's data-constrained regime, repeated tokens yield diminishing returns
- Larger models in the repeat-token regime overfit more β naive extrapolation fails
- Procedural choices (rounding, fitting precision) shift coefficients more than expected
- The essay is ~25 minutes of reading and includes formal derivations