AI Scaling Laws Guide Billions in Compute Spend: Weng Reveals the Cracks

The empirical foundation guiding hundreds of billions of dollars in AI infrastructure investment received a forensic reexamination on June 24, when Lilian Weng — co-founder of Thinking Machines Lab and former VP of Research and Safety at OpenAI — published a long-form technical deep dive titled "Scaling Laws, Carefully" on her widely-read Lil'Log blog. In roughly 25 minutes of dense reading, the post resolves one of the most consequential methodological disputes in modern AI research — why Kaplan et al. (2020) and the Chinchilla paper (2022) arrived at starkly different prescriptions for how to allocate compute when training large language models — and surfaces a deeper problem: the power-law fitting methods that produced both prescriptions are more sensitive to small implementation choices than practitioners typically acknowledge.

For any organization making billion-dollar hardware and training decisions based on AI scaling law extrapolations, Weng's conclusion amounts to a caution: those clean lines on a log-log plot carry more uncertainty than they appear to.

AI scaling laws describe a deceptively simple relationship: training loss decreases predictably as a power law when model size, training dataset size, and total compute are scaled up. A researcher who fits this relationship on a handful of small runs can, in principle, extrapolate to estimate the token and compute requirements for a run many orders of magnitude larger — before spending the money to run it.

The practical power of this predictability is enormous. It is the reason major AI labs can commit to training runs costing tens or hundreds of millions of dollars rather than building every configuration empirically. It is also the reason that Kaplan et al.'s 2020 paper — which found that model size should grow considerably faster than dataset size as compute scales — became the operating assumption behind a generation of frontier models that were, as Chinchilla would later demonstrate, badly undertrained on data.

The core of Weng's post is a careful reconstruction of why Kaplan et al. (2020) and Hoffmann et al. (2022) — the Chinchilla paper — arrived at such different numbers.

Kaplan et al. concluded that for every 10x increase in compute, researchers should scale model parameters by roughly 5.5x while increasing training tokens by only about 1.8x. The Chinchilla team, working at more than ten times the scale of Kaplan's experiments, found that model size and training tokens should grow in roughly equal proportion — a conclusion they demonstrated concretely by training a 70-billion-parameter model on 1.4 trillion tokens, four times smaller in parameters than DeepMind's Gopher but trained on four times the data, and watching it outperform the larger model across the board.

Weng traces the disagreement to two concrete, verifiable sources. First, Kaplan et al. conducted their experiments largely on models well below 1.5 billion non-embedding parameters. Extrapolating in log-log space from that regime to frontier scales amplifies any error in the fitted exponent. Second, the two teams measured model size differently: Kaplan excluded embedding parameters, which represent a non-negligible fraction of total parameters in smaller models. Drawing on analysis by Pearce and Song (2024), Weng shows that when you apply an embedding correction, Kaplan's implied optimal scaling exponent converges toward Chinchilla's figure as model size grows. The two papers were, in a meaningful sense, both right about their respective size regimes.

Weng also addresses a 2024 discovery that attracted significant attention among scaling law researchers: Besiroglu et al. at Epoch AI found two concrete problems in Chinchilla's Method 3, the parametric fitting branch. The Chinchilla team averaged Huber loss values over training runs instead of summing them, which caused the L-BFGS-B optimizer to terminate prematurely. A related rounding issue — reporting exponents to two decimal places rather than their full precision — produced confidence intervals so narrow that, as Besiroglu et al. calculated, achieving them legitimately would have required over 600,000 experiments; Chinchilla ran fewer than 500.

Weng's position is that these bugs matter for understanding how the fits were produced — but they do not invalidate Chinchilla's central conclusion. The three independent methods used in the paper (fixing model sizes while varying token budgets, isoFLOP profiling, and parametric fitting) all pointed to the same compute-optimal frontier. The bugs primarily affected the internal consistency of the parametric branch, not the consensus across all three. Practitioners should understand the bugs and their source; they should not conclude that the 20:1 token-to-parameter guidance is wrong.

Read more: Meta Conscripts 6,500 Engineers as Data Labelers: Revolt Exposes AI Training Ceiling

Perhaps the most immediately actionable section of Weng's post concerns the data wall — the approaching exhaustion of unique, high-quality text available for pretraining. Epoch AI has projected that the stock of quality-filtered public text will be fully utilized somewhere between 2026 and 2032 under current training rates, with frontier labs already facing significant constraints on unique token budgets during the current window.

When unique tokens run short, labs must train on repeated data — and the standard Chinchilla framework, built on the assumption of infinite unique data, does not account for this. Weng reviews the corrections. Muennighoff et al. (2023) showed that repeated tokens do not simply dilute training efficiency linearly — their value decays exponentially with each repetition, with a learnable half-life parameter that determines how quickly each additional pass over the same data loses returns.

Lovelace et al. (2026) pushed the modeling further by introducing an explicit overfitting penalty built around the capacity ratio of model parameters to unique training tokens. Their empirical finding: excess parameters — model size beyond what the available unique token budget can support — lose value faster than repeated data does. The practical implication is that under data constraints, labs should favor more training passes over the same dataset rather than scaling model size further. One meaningful mitigation: stronger weight decay significantly reduces the penalty incurred from data repetition, a lever that may see increasing use as unique token budgets tighten.

The most structurally important section of Weng's post may be the one that receives the least attention in casual summaries: her candid accounting of how sensitive power-law fits are to choices that appear trivial. Whether losses are summed or averaged before fitting, how many decimal places of precision are retained in reported exponents, and which size regime anchors the fit — all of these affect the implied optimal compute allocation substantially when the fitted relationship is extrapolated across orders of magnitude.

Weng includes a toy simulation illustrating how small perturbations in any of these dimensions shift the implied allocation by amounts that, at frontier scale, represent the difference between an appropriately sized model and a significantly mis-sized one. The implication is pointed: any organization allocating compute and hardware investment based on scaling law extrapolations is operating with more uncertainty than the clean appearance of a log-log plot suggests.

This is the unstated implication that the technical literature rarely foregrounds explicitly: the confidence intervals reported in scaling law papers — already known to be too narrow in Chinchilla's Method 3 — may not capture the true range of outcomes when those fits are applied to runs ten or a hundred times larger than the experiments used to produce them. The lines look definitive. The fits that produced them are not.

Why did Kaplan et al. and Chinchilla reach such different conclusions about how to allocate training compute?

Two methodological differences account for most of the disagreement. Kaplan's experiments used models well below 1.5 billion non-embedding parameters, a regime where the fitted scaling exponent is inflated by the exclusion of embedding parameters. Chinchilla ran at more than ten times that scale and counted all parameters including embeddings. When Pearce and Song (2024) applied an embedding correction to Kaplan's framework, the implied optimal exponent converged toward Chinchilla's. The disagreement was partly a product of measuring model size differently in a regime where the measurement choice matters.

What is the AI training data wall, and when does it hit?

The data wall refers to the approaching exhaustion of unique, high-quality human-generated text available for pretraining large language models. Epoch AI has projected that the stock of quality-filtered public text will be fully utilized somewhere between 2026 and 2032 under current training rates, with frontier labs already facing constraints on unique token budgets. Once unique tokens run short, models must train on repeated data — and the value of each repeated token decays exponentially, requiring revised scaling formulas that account for the growing overfitting penalty from data repetition.

Why do the bugs found in Chinchilla's code matter, even if the core conclusion survived?

The bugs — averaging rather than summing Huber loss values, and two-decimal rounding of exponents — caused the optimizer to terminate prematurely and produced implausibly narrow confidence intervals. Their significance is methodological: they demonstrate that the parametric fitting pipeline used to produce scaling law coefficients is sensitive to implementation choices that practitioners rarely examine. Even when the conclusion is robust across three independent methods, as Chinchilla's was, the fitting machinery that generated the specific numerical coefficients can be quietly wrong in ways that are hard to detect without a replication attempt.

What can AI researchers do differently based on Weng's analysis?

Practitioners should apply stronger weight decay when training on repeated data, prefer runs that add training passes over the same dataset rather than scaling model size further when unique token budgets are constrained, and treat power-law extrapolations as carrying wider uncertainty than their confidence intervals typically suggest. When fitting scaling laws, losses should be summed rather than averaged before optimization, exponents should be retained at full precision, and the size regime used to anchor the fit should be as close as possible to the target deployment scale.

AI Scaling Laws Guide Billions in Compute Spend: Weng Reveals the Cracks

Related Stories

4 dead amid flooding caused by heavy rains, Kentucky governor says

Lionel Messi becomes first man to score in 7 straight World Cup games with free kick goal in win over Jordan

Global News Podcast | Venezuela races to find earthquake survivors

Is Iran out of the World Cup? Third

Manitoba wildfire forces mandatory evacuation of Lynn Lake

Why do mosquitoes seem to love some people more? An expert explains

Stampeders ruin Lions’ party with 41

Edmonton police shoot man dead after alleged assaults during hit-and