Grokking and the Critical Point: When Neural Networks Cross the Phase Boundary — Neeraj

The Experiment That Named a Phenomenon

In 2022, Alethea Power and colleagues at OpenAI published a paper with an unusual title: "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." The experimental setup was deliberately minimal. A small transformer trained on a single task: modular arithmetic — computing expressions of the form (a + b) mod p, where p is a prime number.

What made the experiment remarkable was the training curve. The model memorized the training set quickly — training accuracy climbed toward 100%. But validation accuracy, the measure of genuine generalization, stayed near zero. The model had learned to recognize specific examples without understanding the underlying rule.

The researchers continued training. Step after step, thousands of steps, the metrics didn't move.

Then, at roughly 60% of the total training budget, validation accuracy jumped from near zero to near 100% in a relatively short window. Not gradually. At a threshold. The same model that had been a memorizer became a generalizer.

~60%Training BudgetWhen grokking transition occurred in the original experiment

~0%Pre-Transition Val AccuracyModel performed as a pure memorizer

~100%Post-Transition Val AccuracyGeneralization emerged sharply, not gradually

The researchers named this grokking because the model appeared to have suddenly understood something it had previously only mimicked. The phenomenon has since been confirmed across multiple architectures and tasks.

The Ising Model and the Nature of Critical Points

The Ising model is one of the most studied systems in theoretical physics — a model of ferromagnetism analyzed exhaustively since 1925. The setup: imagine a lattice of atoms, each with a "spin" pointing up or down. Neighboring spins prefer to align.

Regime	Dominant Force	State
High temperature	Thermal fluctuations	Disordered — no net magnetization
Critical temperature Tc	Poised at boundary	Scale-invariant — correlations at all lengths
Low temperature	Spin interactions	Ordered — strong magnetization

Between these regimes is a critical temperature Tc. At Tc, the system is poised between order and disorder. Correlations extend to all length scales. The transition is a second-order phase transition: the order parameter goes from zero to nonzero discontinuously.

The reason this matters for neural networks is structural, not metaphorical. A neural network during training is a dynamical system evolving through a high-dimensional landscape. Below a threshold, the network settles into configurations that represent training examples locally, without global structure — the memorization states.

Grokking is what happens when the network crosses its critical point and transitions to a qualitatively different configuration: one with global structure that captures the rule rather than the examples. Below the threshold, a memorizer. Above it, a generalizer. The transition is phase-transition-shaped.

The Weight Norm Finding and Latent Heat

The most interesting empirical finding: in networks that grokk, the effective complexity — measured by the norm of weights — decreases in the interval before generalization emerges.

This is counterintuitive. You might expect complexity to increase as the network learns more. Instead, the network compresses — weights become smaller and more regular — before suddenly generalizing.

The physical analogue is latent heat. In a first-order phase transition (ice → water), there's an interval where the system absorbs energy without changing temperature. The energy goes entirely into the work of the phase change itself. The temperature gauge flatlines. From the outside, it looks like nothing is happening.

Physical Latent Heat (Ice → Water)

System absorbs energy without changing temperature. External gauge flatlines. Internal structure is undergoing the phase change.

Grokking Latent Interval

Training loss continues declining while validation accuracy stays flat. Network is shedding memorization-complexity and building generalizing structure internally.

The implication for practitioners is direct: the loss curve is not the same thing as the learning curve. A model whose training loss is decreasing may be in a latent heat interval, internally reorganizing in ways your external metrics don't capture. Stopping training here — because the validation curve looks flat — would cut off a model mid-transition.

The Renormalization Group and Scale-Invariant Representations

The renormalization group is a mathematical framework developed in the 1970s, largely by Kenneth Wilson (Nobel Prize, 1982), for understanding behavior at and near critical points.

The core insight: at criticality, the system's behavior is the same at all scales. Small fluctuations look like large fluctuations, scaled down. This scale-invariance is exact at the critical point. If you systematically average over small-scale details, you get the same theory at every scale.

For neural networks, the generalization that grokking represents has a similar structure. A memorizing network has learned representations specific to the training set. A generalizing network has learned something scale-invariant — a rule that applies to inputs not in the training set, to variations not explicitly trained on.

The generalizing representation is, in this sense, the renormalization-group fixed point of the learning problem. This is not merely analogy — the mathematical structure of scale-invariance in neural network representations is an active research area with formal results.

The Production Problem

All of this would be academic if it didn't have direct implications for deployment. It does.

The central problem: if a model is deployed before it has crossed its generalization phase boundary — if it's still in the memorization regime — it is a fundamentally different system than a model that has crossed. The memorizer and generalizer have identical architectures, identical training data. Their training loss may be similar. Their in-distribution evaluation performance may be identical.

But their behavior on out-of-distribution inputs will be categorically different.

This creates a specific production pathology: the model that passes your evaluation benchmark may be a memorizer. If your benchmark draws from the same distribution as training, a memorizer will pass it. The benchmark measures what you can easily test — not what you care about in production.

The correct response: design evaluation that is explicitly out-of-distribution — examples sharing the structure of the training task but not its specific surface features.

At hireEZ, building evaluation protocols that probe out-of-distribution behavior was one of our most consequential investments. Some models we were confident about turned out to be memorizers. Some that looked marginal on in-distribution benchmarks generalized robustly.

What You Cannot Know Before the Transition

There is a harder implication. One of the fundamental results about phase transitions: you cannot predict post-transition behavior from pre-transition measurements. The disordered state doesn't tell you what direction the ordered state will magnetize in. The outcome is determined by fluctuations at the critical point, not the pre-transition state.

For neural network training: you cannot reliably predict from a model's behavior before grokking what its behavior will be after. Extrapolating capability from pre-transition to post-transition performance is physically unjustified.

This is uncomfortable because most evaluation infrastructure assumes today's behavior predicts tomorrow's. This assumption is reasonable for stable systems. It is not reasonable for systems near a phase transition.

Practical consequences:

Benchmarks at one training point don't reliably predict later performance, especially across a phase boundary
Training runs cut short may stop during a latent heat interval — apparently making no progress, but would have grokked shortly after
The deployed model is the memorizer; the model with 20% more training budget would have been the generalizer
The performance difference in production is not 20% — it is categorical

Toward More Honest Evaluation

The response: build evaluation pipelines that measure what grokking theory says matters — out-of-distribution generalization, explicitly tested, at multiple points during training.

Concretely:

Run your evaluation suite on models saved at regular intervals through training, not just at the end.
Plot the trajectory. If validation accuracy is flat for a long stretch then jumps, you've observed grokking — and the model is post-transition.
Track the L2 norm of weights over training. A decrease coinciding with a validation accuracy spike is a structural signature of grokking.
If the weight norm is still high and validation accuracy still low, the model may be approaching but not yet past the critical point.

// key takeaway

None of this makes the problem fully tractable. Phase transitions are intrinsically hard to predict. But knowing that learning is phase-transition-shaped, not slope-shaped, changes the questions you ask, the measurements you take, and what "the model is doing well" actually means.