Skip to content
← all posts
AIHR TechRetrospectiveEngineering LeadershipLLM

Three Years at the AI Recruiting Frontier: What We Got Right and Wrong

In 2022 we shipped our first AI feature — a relevance scorer. In 2025 we're running agentic interview pipelines. The distance between those two artifacts is enormous.

The 2022 scorer was a weighted combination of keyword overlap, seniority delta, and geographic proximity. We ran it on top of our search index, reranked results, and called it AI-powered candidate discovery. The name was generous.

It produced better-ranked output than pure keyword search. It said nothing about why a candidate fit a role, didn't adapt across different role contexts, and didn't improve as it saw more data. It was a better sorter, not an evaluator.

Three years later, we're running structured voice and chat interviews at scale — evaluating candidate responses against role-specific rubrics, routing candidates through hiring stages autonomously, and handing off shortlists with evaluation summaries that recruiters actually trust. The gap between those two systems isn't iteration. It's a category change.

Here's an honest account of the calls we got right and the ones we got wrong.


Three Things We Got Right

Investing in the Candidate Data Graph Early

Before we built any AI features, we made a structural bet: invest in the data layer. We built a candidate data graph that connected profile data, engagement history, sourcing context, and behavioral signals. At the time, it felt like infrastructure work that wasn't directly delivering customer value.

It became the foundation for everything that followed. Every AI feature we built in the years after drew from that graph — the relevance scorer, the screening models, the conversation AI — all of them were better because the underlying data was connected, clean, and growing.

AI features are only as good as their training signal, and the signal quality compounds over time when you invest in it early.

Teams that skipped the data foundation built AI features on top of fragmented, inconsistent data. The features worked in demos. They degraded in production as edge cases accumulated.

Treating AI as a Product Feature With Its Own Evaluation Discipline

The shift that separated our better AI work from our worse was treating evaluation as a first-class engineering practice rather than a post-hoc check.

For every AI feature, we defined what "good" looked like before we shipped it. That meant:

  1. A test set
  2. An evaluation rubric
  3. A metric we tracked over time

When a model update shipped, we ran the evaluation suite before and after. When a customer reported unexpected behavior, we had a framework for diagnosing whether it was a model regression or a data problem.

This sounds obvious. In practice, most teams don't do it. They ship an AI feature, gather anecdotal feedback, and iterate based on individual complaints. That approach produces local improvements and global drift.

The evaluation discipline forced us to be specific about what we were optimizing for, and it caught regressions that anecdotal feedback would have missed.

Cost Discipline From Day One

When we started using LLM APIs in production, we measured token costs per operation from the beginning. Not as a constraint that blocked shipping, but as a signal that shaped architecture.

It changed decisions:

  • We built caching layers for inference results likely to be reused.
  • We designed prompt structures to be efficient rather than verbose.
  • We made deliberate choices about which operations warranted GPT-4-class models and which could use smaller models without quality loss.

When token costs became a real budget line item, we were already operating an efficient system rather than scrambling to retrofit one.

Teams that ignored costs early found themselves with AI features that worked but couldn't scale economically. Retrofitting cost discipline into a production system is harder than building it in.

Three Things We Got Wrong

Siloed Evaluation Pipelines

Each AI feature we built had its own evaluation pipeline. The relevance scorer had one. The screening model had another. The conversation AI had a third. None of them talked to each other.

The consequence was that we had no cross-feature signal:

  • We couldn't see whether a candidate the relevance scorer ranked highly was also performing well through the screening pipeline.
  • We couldn't correlate conversation AI quality scores with downstream recruiter satisfaction.
  • Each feature was evaluated in isolation against its own local metric.

We left a lot of learning on the table. The most valuable signal in a recruiting AI system is the end-to-end hiring outcome — did the candidate get hired, and did the hire work out? That signal spans every feature. Building evaluation in silos meant we couldn't use it.

If we built it again, we'd design the evaluation infrastructure as shared infrastructure from the start, with a common event model and the ability to join signals across features. Siloed evaluation is one of the most costly mistakes an AI team can make.

Underestimating the Trust Gap

We underinvested in explainability, and it cost us adoption.

When we shipped the screening AI, customers asked: "Why did it rate this candidate at this score?" We had an answer at the model level — here are the factors, here are the weights. That answer didn't satisfy recruiters. They needed to be able to explain to hiring managers why a candidate was or wasn't moving forward.

"The AI scored them lower" is not an acceptable explanation in a professional hiring context.

What We Had

A model-level explanation: here are the factors, here are the weights. Technically accurate, numerically precise, and completely useless for human-to-human conversations about hiring decisions.

What Recruiters Needed

A recruiter-legible narrative: specific candidate behaviors mapped to specific role requirements, phrased in terms hiring managers could act on and defend in a calibration meeting.

The trust gap was real and it slowed adoption. Customers who didn't understand how the AI made decisions were reluctant to rely on it. We had to retrofit explainability into features that hadn't been designed with it — which is harder and messier than building it in.

Explainability isn't a nice-to-have for AI features in high-stakes workflows. It's a requirement. Customers need to understand the AI before they'll trust it, and trust is the actual adoption gate.

Ignoring Latency Until It Became a Complaint

We shipped AI features with latency profiles we wouldn't have accepted in non-AI features, and rationalized it because "AI is expected to be slow."

That rationalization was wrong. Latency is a product quality dimension regardless of the underlying technology:

  • A screening interview that takes four seconds to respond to a candidate's answer degrades the candidate experience in measurable ways.
  • A relevance scorer that takes three seconds to rerank search results trains recruiters not to use it.

We didn't set latency budgets for AI features the same way we set them for other product features. We paid for that in adoption and in customer complaints that took us by surprise.

AI features need latency requirements defined before they're built, treated with the same seriousness as correctness and cost. "It's AI, it can be slower" is a rationalization, not an engineering decision.

What the Next Wave Looks Like

The trajectory is toward agentic systems — AI that doesn't just assist human decisions but takes autonomous action within defined policy boundaries.

The agentic interview pipelines we're running today operate with a recruiter setting the policy: the interview structure, the evaluation criteria, the routing rules. The AI executes within that policy autonomously. Human oversight happens at the policy level, not at the decision level.

This is a genuinely different paradigm from AI as a decision support tool. The productivity leverage is much larger. So is the trust requirement — customers have to trust that the AI is executing their intent correctly at scale, not just producing good output in individual cases.

The teams that will win the next wave are the ones that have already built the trust infrastructure: evaluation discipline, explainability, cost and latency rigor. Those investments compound.

// key takeaway

Three years in, that's the most durable lesson: the boring infrastructure work — data quality, evaluation discipline, latency rigor — is what makes the impressive AI work possible. The flashy agentic capabilities rest entirely on foundations that aren't flashy at all.