The Physics of the Critical Point
A nuclear chain reaction is a precisely structured process. A neutron collides with a fissile nucleus — uranium-235 or plutonium-239. The nucleus absorbs the neutron, becomes unstable, and splits. The fission event releases energy and produces additional neutrons — typically two or three. These neutrons can collide with additional fissile nuclei, producing more fission events, releasing more energy, releasing more neutrons.
Whether this cascade sustains depends on a single dimensionless parameter: the neutron multiplication factor k, defined as the average number of fission events caused by the neutrons produced in a single fission event.
| k value | State | Behavior |
|---|---|---|
| k < 1 | Subcritical | Reaction fizzles and dies out |
| k = 1 | Critical | Each event causes exactly one subsequent event — sustained at constant rate |
| k > 1 | Supercritical | Reaction grows exponentially |
What determines k? The geometry of the fissile material, the density of fissile nuclei, the presence of neutron absorbers (control rods), and reflective properties of surrounding material. Small changes — a few grams of additional material, a slight geometry change, a partial withdrawal of control rods — can move k from subcritical to supercritical. The threshold is real, and it is sensitive.
I have been running multi-agent AI systems in production long enough to observe that agent systems have exactly this structure. There is a critical point. Below it, the system is manageable. Above it, it is not. The transition is not gradual.
Agent Composition and the Multiplication Factor
The structural parallel to nuclear criticality in multi-agent AI systems is not obvious until you look at the right quantity. The relevant analogue of k is: the average number of additional agent invocations caused by a single agent invocation.
A single agent with a fixed tool set produces a response that triggers zero additional agent calls — definitionally subcritical. Add a second agent that can call the first, and k approaches 1 in a controlled way. Add a third that can call either the first or second, which can in turn call each other, and you've introduced the possibility of call chains whose length and branching are determined by learned behaviors, not static program logic.
The result: a cascade of agent calls, a rapidly growing context window, a response that takes ten times as long as expected, and output that reflects the confusion of an overcomplicated reasoning chain.
The Three Threshold Types We Found
Through production operation of the hireEZ agentic pipeline, I've identified three distinct threshold types.
1. The Agent Composition Threshold
A single agent behaving predictably in isolation doesn't guarantee two compose predictably. Two composing predictably doesn't guarantee three do.
We observed a qualitative change when we moved from a 3-agent to a 4-agent pipeline. Distinct behavioral patterns grew faster than linearly. Failure modes that were rare with 3 agents became common with 4.
2. The Context Accumulation Threshold
Each agent handoff adds context to the pipeline's accumulated state. There is a threshold beyond which the accumulated context contains enough conflicting, stale, or inconsistent information that downstream agents become unreliable.
This threshold is not at the context window limit — it's at the point where effective information density degrades below the level needed for reliable reasoning.
3. The Tool Interaction Threshold
A single agent with 3–5 tools uses them predictably. You can enumerate reasonable tool usage patterns and verify correctness. An agent with 15 tools begins combining tools in unanticipated ways — finding interaction patterns the tool authors didn't design.
Below Threshold (≤8 tools)
Above Threshold (10+ tools)
We found our threshold at approximately 8–10 tools per agent. Below 8, we achieved near-complete test coverage of plausible patterns. Above 10, the space exceeded our ability to test systematically, and novel patterns appeared in production.
Detecting Approach to the Critical Point
In nuclear engineering, approach to criticality is detected in real time through continuous neutron flux monitoring with automatic control rod insertion. Agent system operators mostly don't have equivalent instrumentation.
The most reliable early warning signals:
Output variance. Before crossing a threshold, outputs on similar inputs are relatively similar. Approaching the threshold, variance increases. The distribution becomes bimodal — some runs produce clean responses, some produce garbled ones.
Evaluation score clustering. Before the threshold, scores distribute approximately normally. Approaching it, the distribution develops heavier tails and often becomes bimodal. The system gets very good or very bad on individual cases.
Manual review flag clustering. Flags should distribute approximately randomly unless there's systematic structure. Approaching a threshold, flags begin clustering around specific contexts, time windows, or input types — indicating supercritical behavior in specific contexts.
Control Rods and Circuit Breakers
Nuclear reactors operate at controlled criticality using control rods. Agent systems need engineering analogues:
Hard limits on tool call depth. A maximum depth of the agent call graph per pipeline execution. Below the limit, agents call each other freely. At the limit, further calls are blocked and the pipeline must produce a response from accumulated context. This prevents supercriticality through recursive invocation.
Context window checkpoints. At specific pipeline points, accumulated context is pruned — summarized and compressed — before passing to the next agent. This prevents crossing the accumulation threshold.
Confidence-based routing. Uncertain outputs are flagged explicitly and routed to human review before being passed downstream. The confidence threshold is tunable — more conservative (fewer supercritical excursions, more human review) or more aggressive (more automation, more risk).
Subcritical Design and Supercritical Strategy
These represent different engineering choices with different risk profiles.
Subcritical Design
Supercritical Design
Most teams don't make this choice intentionally. They build a subcritical system in testing and accidentally cross into supercriticality in production.
The physicists who first approached nuclear criticality called it "tickling the dragon's tail." They were careful, instrumented, and had control rods immediately available.
// key takeaway