Weng Insights 🔥 Top

Scaling Laws, Carefully

Piatok 26. júna 2026 • Source: Lil'Log

Main idea

In a long essay, Lilian Weng reconstructs the history of neural scaling laws from Amari and Hestness through Kaplan (2020) and Chinchilla (2022), arguing that the famous disagreement between them (model-heavy vs. balanced model/data scaling) largely stems from a methodological artifact: Kaplan used smaller models where embedding parameters dominate the total parameter count, biasing the curve fit.

Context

It lands at a moment when frontier labs (OpenAI GPT-5.5, Anthropic Claude Opus 4.8, Google Gemini 3.5) have moved to MoE architectures and invest billions in training runs where computing 'optimal model size vs. data' determines compute ROI. At the same time the industry is hitting the 'data wall' — train tokens are limited and repeated tokens give diminishing returns. Weng publishes rarely — this is her first long technical post in several months.

Why it matters

For ML engineers and researchers this will be a reference post cited in pretraining and compute-allocation debates. For decision makers at large labs the important takeaway is that scaling-law coefficients are surprisingly sensitive to procedural choices (how parameters are counted, rounding, precision) — so billion-dollar bets built on them carry more uncertainty than is intuitive.

Details / arguments

The Kaplan vs. Chinchilla gap is largely about whether embedding weights count in the parameter total
In today's data-constrained regime, repeated tokens yield diminishing returns
Larger models in the repeat-token regime overfit more — naive extrapolation fails
Procedural choices (rounding, fitting precision) shift coefficients more than expected
The essay is ~25 minutes of reading and includes formal derivations

Open original source Lil'Log