The conversation with our CFO went roughly like this.
"The AI screening feature is getting great adoption. Usage is up significantly month over month."
"That's great. What's it costing us?"
"More than we expected."
That conversation — some version of it — happens at every company that scales an LLM product past the prototype stage without building token economics into the architecture from the start. The AI costs are buried in infrastructure spend for a few months, and then someone does the unit economics math and the numbers are uncomfortable.
At 100+ enterprise customers running AI screening, scheduling, and evaluation workflows at scale, token costs compound fast. What looks reasonable per-call at low volume becomes significant margin pressure at high volume. And unlike compute costs that scale smoothly, LLM token costs have a structure that rewards architectural discipline and punishes naivety in specific, predictable ways.
The Input/Output Asymmetry You Need to Understand
The most important thing about token pricing that doesn't get enough attention: input tokens and output tokens are priced differently, and the ratio matters enormously for architecture.
Across most frontier model providers, output tokens cost 2–4x more than input tokens. The asymmetry exists because generating tokens is computationally more expensive than processing them.
This has direct architectural implications:
-
Verbose outputs are expensive waste. A prompt that asks for "a detailed narrative evaluation" generates far more output tokens than one that asks for "a JSON object with scores and a one-sentence rationale per dimension." If the downstream consumer is your application logic, the verbose version is just expensive waste.
-
Input compression has diminishing returns. Compressing your system prompt to save 200 input tokens per call may not be worth three hours of engineering time. The bigger wins are almost always on the output side: tighter output contracts, structured formats, explicit length constraints.
Prompt Compression That Actually Works
Prompt compression gets a lot of hype and a lot of skepticism. Both are partly warranted.
The skeptical view: aggressive compression reduces context quality, which reduces output quality, and the savings aren't worth the degradation. This is true if you compress naively — stripping whitespace, cutting sentences mid-thought, removing grounding examples.
The more nuanced reality: there are large categories of prompt content that can be compressed significantly without quality loss, and a smaller category where compression carries real risk.
High Compression Potential
Low Compression Potential (Load-Bearing)
Caching Strategies at Scale
Caching is the highest-leverage cost reduction strategy available to most production LLM systems, and it's underused because LLM caching strategies differ from database caching strategies.
Prompt prefix caching. Most major model providers now support prefix caching — if the same prefix is used across multiple calls, the KV cache is reused rather than recomputed. This is significant for systems where a large, static system prompt precedes every call.
The architectural implication: structure prompts so that static, shared content (system instructions, output format, role definition) appears at the beginning and dynamic, session-specific content appears at the end. This maximizes the cacheable prefix length.
Semantic caching. The idea: many LLM calls in a production system are semantically similar to prior calls — not identical, but close enough that a cached response is a high-quality answer. A semantic cache returns cached responses when the incoming query is within a configurable similarity threshold.
| Caching Type | Works Well For | Works Poorly For |
|---|---|---|
| Prefix caching | Any call with a large static system prompt | Calls with highly dynamic prefixes |
| Semantic caching | Lookup tasks, classification, role understanding across candidates | Generative tasks where specific input content matters |
The Model Selection Decision
A question I get asked a lot: when do you use a smaller, cheaper model versus a frontier model?
The answer: think about it as a pipeline architecture problem, not a model comparison problem. Most production LLM workloads are a sequence of subtasks with different complexity profiles.
In our screening pipeline, the task sequence:
- Parse and structure the input — simple
- Classify role requirements — moderate
- Generate evaluation rubric — hard
- Evaluate each response dimension — hard
- Synthesize overall assessment — moderate
- Format structured output — trivial
Running the entire pipeline on a frontier model is expensive. Running it entirely on a smaller model produces lower quality on the hard subtasks. The answer is a tiered pipeline: cheap models on simple tasks, capable models on hard tasks, with explicit handoff boundaries.
The Token Budget as Product Design Constraint
The framing shift that made the biggest difference: treat the token budget not just as a cost optimization problem, but as a product design constraint that shapes what features you build and how.
Features have token floors — minimum token costs to deliver at acceptable quality. If a feature's token floor makes unit economics negative at your price point, the feature isn't viable without a redesign.
Concretely: we had a proposal for a "deep candidate profile" that would generate a multi-page narrative analysis. The output quality was impressive. The token cost made us lose money on every usage. We redesigned around a structured profile with dense, specific content instead of narrative prose — cut output tokens substantially and shipped something customers valued more because it was more actionable.
What the CFO Conversation Looks Like Now
The architecture work — tiered model pipeline, prompt compression, prefix caching, semantic caching, output verbosity matched to consumer requirements — produced meaningful cost reduction per screening while simultaneously improving output quality on harder tasks.
The conversation with the CFO is different now. The AI costs are still worth scrutinizing. But we can explain:
- Exactly what drives them
- Unit economics at different usage levels
- Where the optimization levers are
- That the cost curve scales sub-linearly with volume because caching hit rates improve with scale
- The deliberate decision to spend more on frontier model calls for high-value evaluation steps
// key takeaway