×
New method tracks how AI models actually make predictions after scaling
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI researcher Patrick O’Donnell has introduced “landed writes,” a new method for understanding how large language models make predictions by tracking how internal components actually influence outputs after normalization scaling. The approach addresses a critical gap in current AI interpretability tools, which measure what model components intend to write rather than what actually affects the final answer after the model’s internal scaling processes.

The core problem: Most AI interpretability tools completely miss how transformer models internally reshape component contributions through RMSNorm scaling, which can amplify early-layer writes by up to 176× while compressing late-layer contributions.

  • When a neuron writes +0.001 in attention layer 2 of Mistral 7B, it gets amplified to +0.176, but the same write in layer 31 only becomes +0.0058—a 30× difference in actual impact.
  • Traditional attribution methods track pre-normalization values, essentially measuring “what someone intended to say rather than what was actually heard.”
  • Experiments show 98.8% of all writes get amplified by more than 2×, with 81.2% amplified by more than 10×.

How landed writes work: The method tracks exactly what model components contribute to the residual stream after RMSNorm scaling, using the formula: landed_i = (write_i / σ) · γ_i.

  • Instead of attributing behavior to unscaled writes, landed writes measure the scaled values that actually influence token selection.
  • The approach requires just one or two forward passes with strategic hooks to capture pre-normalization writes and track how RMSNorm’s scaling factors reshape them.
  • A practical example: Two neurons initially write +0.0006 and +0.0004, but after 176× scaling in layer 2, their landed contributions become +0.1060 and +0.0700.

In plain English: Think of this like a sound system where different musicians (model components) play their instruments at various volumes, but the final mix that reaches your ears gets adjusted by an equalizer that boosts some frequencies and dampens others. Traditional AI analysis tools measure how loud each musician intended to play, but landed writes measure what you actually hear after the equalizer does its work—which can be dramatically different.

Key findings from experiments: Testing on models like LLaMA-3.1-8B and Mistral-7B revealed surprising patterns in how transformers actually process information.

  • Extreme sparsity: Logit predictions often rely heavily on just 11-90 coordinates out of thousands of available dimensions.
  • Systematic scaling effects: Early layers (0-3) show massive amplification, middle layers moderate scaling (2-20×), and late layers compression or near-unity scaling.
  • Stability across prompts: Each coordinate’s scaling factor remains consistent across different inputs, making landed writes predictable.

Why this matters: The research reveals that current interpretability approaches may be fundamentally misdirected by ignoring how models internally reshape component contributions.

  • O’Donnell argues this is “like measuring what someone intended to say rather than what was actually heard,” potentially missing crucial aspects of how AI systems make decisions.
  • The method offers a “mechanically faithful low cost measure of logit selection” that could complement existing interpretability tools.
  • Understanding actual vs. intended contributions could be crucial for AI safety research, as it reveals which components truly drive model outputs.

Practical benefits: The landed writes approach offers several advantages over existing interpretability methods.

  • Accuracy: Exactly tracks how models internally reshape contributions rather than relying on pre-scaling approximations.
  • Efficiency: Requires only one memory-intensive forward pass or two light ones, with no gradient tracking or SAE training needed.
  • Simplicity: Easy to implement with minimal overhead using forward hooks in existing frameworks.

Next steps and implications: O’Donnell suggests several research directions that could leverage landed writes to better understand transformer behavior.

  • Training Sparse Autoencoders (SAEs) on landed writes only might reveal whether this approach captures the most important model computations.
  • Tracking landed writes could help measure actual computational work per layer and detect strategic behaviors like “sigma gaming.”
  • The method might help identify primitive operations in transformers by focusing on atomic units of computation.

What they’re saying: O’Donnell positions landed writes as a complement to existing tools rather than a replacement.

  • “There’s nothing wrong with current tools as the new approach I am suggesting is causally naive. It is a literal tracking of who wrote what and so doesn’t reveal the why.”
  • “We might be only a few steps away from finding the primitive ops of a transformer as we can now focus on the atomic units of computation, landed writes.”
  • The researcher notes the work “costed about $10 of gpu time so you know I’m pragmatic at least.”

Code availability: Full experimental code and results are available at https://github.com/patrickod32/landed_writes/, with O’Donnell seeking endorsement to publish the full paper on arXiv and looking for research lab positions to continue interpretability work.

The AI Safety Puzzle Everyone Avoids: How To Measure Impact, Not Intent.

Recent News

Why human skills – but not the number of humans (sorry) – matter more as AI spreads at work

The answer lies in deepening emotional intelligence, not making AI more human-like.

OpenAI and Oracle expand Stargate to 5 gigawatts in $30B deal

The Texas facility will consume enough electricity to power 4.4 million homes.

I think, therefore I…am what, exactly? Claude 4 expresses uncertainty about its own consciousness.

Anthropic hired its first AI welfare researcher after estimating a 15% chance Claude possesses consciousness.