Critical Mass: Why Agent Systems Have Thresholds, Not Gradients — Neeraj

The Physics of the Critical Point

A nuclear chain reaction is a precisely structured process. A neutron collides with a fissile nucleus — uranium-235 or plutonium-239. The nucleus absorbs the neutron, becomes unstable, and splits. The fission event releases energy and produces additional neutrons — typically two or three. These neutrons can collide with additional fissile nuclei, producing more fission events, releasing more energy, releasing more neutrons.

Whether this cascade sustains depends on a single dimensionless parameter: the neutron multiplication factor k, defined as the average number of fission events caused by the neutrons produced in a single fission event.

k value	State	Behavior
k < 1	Subcritical	Reaction fizzles and dies out
k = 1	Critical	Each event causes exactly one subsequent event — sustained at constant rate
k > 1	Supercritical	Reaction grows exponentially

The critical point at k = 1 is not a gradient. Below it, the reaction fizzles. Above it, the reaction grows without bound. The transition is sharp. The behavior on either side is categorically different.

What determines k? The geometry of the fissile material, the density of fissile nuclei, the presence of neutron absorbers (control rods), and reflective properties of surrounding material. Small changes — a few grams of additional material, a slight geometry change, a partial withdrawal of control rods — can move k from subcritical to supercritical. The threshold is real, and it is sensitive.

I have been running multi-agent AI systems in production long enough to observe that agent systems have exactly this structure. There is a critical point. Below it, the system is manageable. Above it, it is not. The transition is not gradual.

Agent Composition and the Multiplication Factor

The structural parallel to nuclear criticality in multi-agent AI systems is not obvious until you look at the right quantity. The relevant analogue of k is: the average number of additional agent invocations caused by a single agent invocation.

A single agent with a fixed tool set produces a response that triggers zero additional agent calls — definitionally subcritical. Add a second agent that can call the first, and k approaches 1 in a controlled way. Add a third that can call either the first or second, which can in turn call each other, and you've introduced the possibility of call chains whose length and branching are determined by learned behaviors, not static program logic.

The failure mode is not just infinite loops — though that happens. The more subtle failure is that the system crosses from subcritical to supercritical in specific contexts without operators being aware the threshold was crossed. The system was tested in typical contexts where it was subcritical. It enters supercriticality in an edge case that wasn't covered.

The result: a cascade of agent calls, a rapidly growing context window, a response that takes ten times as long as expected, and output that reflects the confusion of an overcomplicated reasoning chain.

The Three Threshold Types We Found

Through production operation of the hireEZ agentic pipeline, I've identified three distinct threshold types.

1. The Agent Composition Threshold

A single agent behaving predictably in isolation doesn't guarantee two compose predictably. Two composing predictably doesn't guarantee three do.

We observed a qualitative change when we moved from a 3-agent to a 4-agent pipeline. Distinct behavioral patterns grew faster than linearly. Failure modes that were rare with 3 agents became common with 4.

The analysis revealed why: with 3 agents arranged linearly, each receives inputs from at most one other agent. With 4, we'd introduced branching — two agents could contribute to a single downstream agent's context. This doubled the context interaction types, approximately doubling output variance. The threshold was a structural property of the communication graph, not just the agent count.

2. The Context Accumulation Threshold

Each agent handoff adds context to the pipeline's accumulated state. There is a threshold beyond which the accumulated context contains enough conflicting, stale, or inconsistent information that downstream agents become unreliable.

This threshold is not at the context window limit — it's at the point where effective information density degrades below the level needed for reliable reasoning.

≤4 stepsReliable completion zoneTasks complete reliably with low hallucination

5–7 stepsCritical transition rangeHallucination rates increase — the accumulation threshold

6+ stepsDegraded reliability zoneIncreased hallucination, unreliable downstream agents

3. The Tool Interaction Threshold

A single agent with 3–5 tools uses them predictably. You can enumerate reasonable tool usage patterns and verify correctness. An agent with 15 tools begins combining tools in unanticipated ways — finding interaction patterns the tool authors didn't design.

Below Threshold (≤8 tools)

Tool use is specified behavior. You can enumerate plausible usage patterns and achieve near-complete test coverage before deployment.

Above Threshold (10+ tools)

Tool use is emergent behavior. Novel interaction patterns appear in production that weren't visible in testing. The space exceeds systematic test coverage.

We found our threshold at approximately 8–10 tools per agent. Below 8, we achieved near-complete test coverage of plausible patterns. Above 10, the space exceeded our ability to test systematically, and novel patterns appeared in production.

Detecting Approach to the Critical Point

In nuclear engineering, approach to criticality is detected in real time through continuous neutron flux monitoring with automatic control rod insertion. Agent system operators mostly don't have equivalent instrumentation.

The most reliable early warning signals:

Output variance. Before crossing a threshold, outputs on similar inputs are relatively similar. Approaching the threshold, variance increases. The distribution becomes bimodal — some runs produce clean responses, some produce garbled ones.

Evaluation score clustering. Before the threshold, scores distribute approximately normally. Approaching it, the distribution develops heavier tails and often becomes bimodal. The system gets very good or very bad on individual cases.

Manual review flag clustering. Flags should distribute approximately randomly unless there's systematic structure. Approaching a threshold, flags begin clustering around specific contexts, time windows, or input types — indicating supercritical behavior in specific contexts.

Control Rods and Circuit Breakers

Nuclear reactors operate at controlled criticality using control rods. Agent systems need engineering analogues:

Hard limits on tool call depth. A maximum depth of the agent call graph per pipeline execution. Below the limit, agents call each other freely. At the limit, further calls are blocked and the pipeline must produce a response from accumulated context. This prevents supercriticality through recursive invocation.

Context window checkpoints. At specific pipeline points, accumulated context is pruned — summarized and compressed — before passing to the next agent. This prevents crossing the accumulation threshold.

Checkpoints must be placed before the threshold, not after — you cannot easily recover from context overload by pruning at the point of overload.

Confidence-based routing. Uncertain outputs are flagged explicitly and routed to human review before being passed downstream. The confidence threshold is tunable — more conservative (fewer supercritical excursions, more human review) or more aggressive (more automation, more risk).

Subcritical Design and Supercritical Strategy

These represent different engineering choices with different risk profiles.

Subcritical Design

Accepts reduced capability for predictability. Tool counts are constrained, context accumulation is managed, composition graphs are kept simple. Right for high-stakes applications: healthcare, legal, financial.

Supercritical Design

Accepts increased risk for increased capability. Deliberately composes agents to produce emergent behavior, investing heavily in circuit breakers, observability, and human review to catch failures. Fits recruiting, customer service, content generation.

Most teams don't make this choice intentionally. They build a subcritical system in testing and accidentally cross into supercriticality in production.

The better approach: instrument for the threshold before you need to find it. Build the variance monitoring. Set explicit limits. Then push toward the critical point in a controlled way, with circuit breakers engaged.

The physicists who first approached nuclear criticality called it "tickling the dragon's tail." They were careful, instrumented, and had control rods immediately available.

// key takeaway

We are doing the same thing with agent systems. Instrument for the threshold before you cross it. Build variance monitoring. Set explicit limits. Engage circuit breakers before you push toward the edge — not after you've already crossed it.