What Measurement Actually Does
The Heisenberg uncertainty principle is one of the most frequently misunderstood results in science. Popular accounts describe it as a limitation on measurement precision. This is true as far as it goes, but misses the deeper point.
Before measurement, the position and momentum of a quantum particle are not merely unknown — they are genuinely indeterminate. The particle exists in a superposition of states. The act of measurement doesn't reveal a pre-existing value. It participates in creating one. The probability distribution collapses to a specific outcome, and the collapse is real.
An LLM is not a static lookup table. When you query a model, you're running a computation whose output depends on the exact form of the input — the prompt, context, examples, format, stated instruction.
Chain-of-Thought as Wavefunction Collapse
Ask a complex reasoning question directly: "What is the best marketing strategy for a B2B SaaS company entering a crowded market?" You get an answer. Now ask with "Let's think step by step" prepended. You get a different answer — often more organized, more thorough, more reliable.
This is not the model "trying harder." Chain-of-thought prompting changes the computation. When the model generates intermediate steps before a conclusion, those intermediate tokens are working memory. The model with step-by-step reasoning has access to its own intermediate outputs as context — categorically different from a one-shot answer.
Few-Shot Context as State Preparation
In quantum mechanics, before measurement, you can prepare the system's state — apply operations that put it into a known superposition. State preparation sets up the system to be measured in a particular way. The outcome depends on both the apparatus and the prepared state.
Few-shot examples in an LLM prompt are state preparation. A model that's seen three examples of formal academic writing is in a different state than the same model with three examples of casual writing. It will handle the query differently — not just in style, but often in content.
At hireEZ, we ran systematic experiments varying few-shot examples while keeping task descriptions constant. Models' outputs shifted in ways correlated with the examples: formal examples → formal evaluations.
Benchmark Contamination and the Measurement Problem
The AI community has a well-known benchmark contamination problem: models evaluated on benchmarks whose test questions appear in the training data. There's a deeper version of this problem, explained by the observer effect.
Even if a model hasn't been trained on specific test questions, it may have been trained extensively on the format — the multiple-choice layout, the instruction style, the reasoning patterns. This format contamination isn't detected by checking for question overlap.
Surface Contamination
Format Contamination
The RLHF Observer Effect
The most powerful version is RLHF. Human evaluators assess outputs. A reward model learns preferences. The language model is fine-tuned to maximize reward. This is not merely observing the model — it is running a measurement process that fundamentally reshapes internal representations.
The post-RLHF model is not the pre-RLHF model with a surface-level bias applied at inference. It is a model whose weights have been updated — whose internal computational structure has changed — in response to the measurement.
RLHF generalizes: a model fine-tuned on human preferences in one domain behaves differently in domains not present in the RLHF training set. The measurement has changed the model in ways that extend beyond the measurement context.
Designing Evaluation That Accounts for the Observer
The engineering response is to build evaluation infrastructure that takes the observer effect seriously.
Principle 1: Triangulation. Use multiple independent measurement approaches and look for convergence. Evaluate with direct prompts, chain-of-thought, few-shot with different example sets, and adversarial reformulations.
| Evaluation Signal | What It Tells You |
|---|---|
| Similar performance across all formats | Measuring something robust |
| Dramatic variation across formats | Capability is measurement-context-dependent |
| High mean, high variance | Unreliable in production — format-sensitive |
| High mean, low variance | Reliable across real-world query distributions |
At hireEZ, every candidate model is evaluated across a matrix of prompt formats, context lengths, and example configurations. We report the full distribution of scores, not a single number. The spread is as informative as the mean.
Principle 2: Format isolation. When evaluating a capability, design the evaluation format to be as dissimilar as possible from the training format, while keeping the capability constant. This pushes against format contamination and state-preparation effects. More expensive, but measures something closer to intrinsic capability.
Principle 3: Red-teaming as controlled observation. Adversarial testing is itself a measurement that, when used in training, changes the system. Red-teaming before deployment gives you information about failure modes. Red-teaming and then retraining on the failures changes the model. Both are valuable — they are not the same thing.
The Invariance You Actually Want
The ideal evaluation outcome is not high performance on any single benchmark. It is performance that is invariant across reasonable variations in measurement approach.
Brittle Capability
Production-Ready Capability
// key takeaway