Skip to content
← all posts
Observer EffectLLM EvaluationAI EngineeringBenchmarkingProduction AI

The Observer Effect: How Evaluating an LLM Changes What It Does

What Measurement Actually Does

The Heisenberg uncertainty principle is one of the most frequently misunderstood results in science. Popular accounts describe it as a limitation on measurement precision. This is true as far as it goes, but misses the deeper point.

Before measurement, the position and momentum of a quantum particle are not merely unknown — they are genuinely indeterminate. The particle exists in a superposition of states. The act of measurement doesn't reveal a pre-existing value. It participates in creating one. The probability distribution collapses to a specific outcome, and the collapse is real.

This is the most precisely tested theory in the history of science. And it captures something structurally important about LLM evaluation — the model's "answer" is not a fixed fact being revealed. It is a function of how you ask. Changing the measurement changes the thing being measured.

An LLM is not a static lookup table. When you query a model, you're running a computation whose output depends on the exact form of the input — the prompt, context, examples, format, stated instruction.


Chain-of-Thought as Wavefunction Collapse

Ask a complex reasoning question directly: "What is the best marketing strategy for a B2B SaaS company entering a crowded market?" You get an answer. Now ask with "Let's think step by step" prepended. You get a different answer — often more organized, more thorough, more reliable.

This is not the model "trying harder." Chain-of-thought prompting changes the computation. When the model generates intermediate steps before a conclusion, those intermediate tokens are working memory. The model with step-by-step reasoning has access to its own intermediate outputs as context — categorically different from a one-shot answer.

This is the observer effect in a precise sense: the measurement apparatus — the prompt format — changes the system being measured. You are not measuring the model's "inherent capability." You are measuring capability when queried in a specific way. The capability is a property of the model-plus-query-format system.
A benchmark using direct-answer prompts measures a different system than one using chain-of-thought, even if the task and model are identical. Treating these benchmark scores as equivalent is a measurement error.

Few-Shot Context as State Preparation

In quantum mechanics, before measurement, you can prepare the system's state — apply operations that put it into a known superposition. State preparation sets up the system to be measured in a particular way. The outcome depends on both the apparatus and the prepared state.

Few-shot examples in an LLM prompt are state preparation. A model that's seen three examples of formal academic writing is in a different state than the same model with three examples of casual writing. It will handle the query differently — not just in style, but often in content.

At hireEZ, we ran systematic experiments varying few-shot examples while keeping task descriptions constant. Models' outputs shifted in ways correlated with the examples: formal examples → formal evaluations.

LargerFew-Shot Effect SizeThan differences across model versions in several cases
3 variablesPer Evaluation MatrixPrompt formats, context lengths, example configurations
Full distributionScore ReportingNot a single number — spread is as informative as the mean
When you evaluate few-shot performance, you are evaluating a specific prepared state, not an intrinsic property. A different set of examples prepares a different state and may produce substantially different accuracy.

Benchmark Contamination and the Measurement Problem

The AI community has a well-known benchmark contamination problem: models evaluated on benchmarks whose test questions appear in the training data. There's a deeper version of this problem, explained by the observer effect.

Even if a model hasn't been trained on specific test questions, it may have been trained extensively on the format — the multiple-choice layout, the instruction style, the reasoning patterns. This format contamination isn't detected by checking for question overlap.

Surface Contamination

Model trained on specific test questions. Detected by checking for question overlap in training data. The known, simpler problem.

Format Contamination

Model trained on the benchmark's format, layout, and instruction style. Not detected by question overlap checks. The deeper, harder-to-detect problem.
The technically correct response: design evaluation formats deliberately different from anything in the training corpus. Same tasks, same difficulty — different format and presentation. You want to measure capability with a measurement apparatus the model hasn't adapted to.

The RLHF Observer Effect

The most powerful version is RLHF. Human evaluators assess outputs. A reward model learns preferences. The language model is fine-tuned to maximize reward. This is not merely observing the model — it is running a measurement process that fundamentally reshapes internal representations.

The post-RLHF model is not the pre-RLHF model with a surface-level bias applied at inference. It is a model whose weights have been updated — whose internal computational structure has changed — in response to the measurement.

RLHF generalizes: a model fine-tuned on human preferences in one domain behaves differently in domains not present in the RLHF training set. The measurement has changed the model in ways that extend beyond the measurement context.

Alignment failures under RLHF are often failures of the measurement apparatus, not the optimization. Reward hacking — where a model achieves high reward scores without genuinely preferred behavior — is the model collapsing to a state the apparatus rates highly but wasn't designed to select for.

Designing Evaluation That Accounts for the Observer

The engineering response is to build evaluation infrastructure that takes the observer effect seriously.

Principle 1: Triangulation. Use multiple independent measurement approaches and look for convergence. Evaluate with direct prompts, chain-of-thought, few-shot with different example sets, and adversarial reformulations.

Evaluation SignalWhat It Tells You
Similar performance across all formatsMeasuring something robust
Dramatic variation across formatsCapability is measurement-context-dependent
High mean, high varianceUnreliable in production — format-sensitive
High mean, low varianceReliable across real-world query distributions

At hireEZ, every candidate model is evaluated across a matrix of prompt formats, context lengths, and example configurations. We report the full distribution of scores, not a single number. The spread is as informative as the mean.

Principle 2: Format isolation. When evaluating a capability, design the evaluation format to be as dissimilar as possible from the training format, while keeping the capability constant. This pushes against format contamination and state-preparation effects. More expensive, but measures something closer to intrinsic capability.

Principle 3: Red-teaming as controlled observation. Adversarial testing is itself a measurement that, when used in training, changes the system. Red-teaming before deployment gives you information about failure modes. Red-teaming and then retraining on the failures changes the model. Both are valuable — they are not the same thing.


The Invariance You Actually Want

The ideal evaluation outcome is not high performance on any single benchmark. It is performance that is invariant across reasonable variations in measurement approach.

Brittle Capability

90% on format A, 60% on format B. The capability is dependent on observation form. Unreliable in production where query formats vary.

Production-Ready Capability

85% across formats A, B, and C. More useful in production because production involves a distribution of query formats you cannot fully control.
We are used to evaluating static systems with static tests. The compiler either produces correct output or it doesn't, regardless of invocation. LLMs are fundamentally different — they are probability distributions over outputs conditionally defined by the full context. Evaluation that ignores this isn't measuring what it thinks it's measuring.

// key takeaway

The observer effect is not a bug in LLM evaluation. It is a structural feature. Building evaluation discipline around it — triangulation, format isolation, and distribution reporting — is not optional if you want to understand what you have before you put it into production.