The Memory Problem: Making Agents That Actually Learn Across Sessions — Neeraj

Every user who interacts with your AI agent more than once runs into the same wall. They've already told the agent their preferences, their constraints, their context. Now they're telling it again. And again. Because the agent doesn't remember.

Stateless agents — agents that treat every session as the first — are the norm today. There are good reasons for this. Memory introduces complexity, compliance risk, and infrastructure requirements that are genuinely hard. But there's also a lot of complexity avoided by simply not solving the problem, and I think the industry has been too comfortable with that avoidance.

In enterprise HR tech, the memory problem is concrete and consequential:

A recruiter has preferences about evaluation criteria, communication style, role-specific benchmarks.
A candidate has a history of interactions with the platform.
The organization has context — what this team values, how they've calibrated similar roles — accumulated over time.

None of that should be re-entered every session. And none of it should be forgotten.

The Four Types of Agent Memory

A useful taxonomy, borrowed from cognitive science and adapted for LLM systems:

Memory Type	What It Stores	Persistence	Implementation
Working	Current context window contents	Session only	Already in use everywhere
Episodic	Record of past interactions and events	Cross-session	Conversation history retrieval
Semantic	Learned facts and accumulated knowledge	Long-term	Vector DB + structured metadata
Procedural	Learned workflows and process patterns	Long-term	Distilled from repeated episodic experience

Working memory (in-context) is the contents of the current context window. Fast, directly accessible, and temporary — it disappears when the session ends. This is the memory type everyone is already using. The challenge: it's bounded, expensive, and non-persistent.

Episodic memory is the record of past interactions and events. What happened in prior sessions? What did the user say, what did the agent do, what were the outcomes? Episodic memory is the foundation of continuity.

Semantic memory is the store of learned facts and accumulated knowledge. Not "what happened" (episodic) but "what is true." A recruiter's preference for a certain evaluation approach is a semantic memory — a fact learned from experience that should persist.

Procedural memory is the record of learned workflows and processes. How does this team handle a certain class of role? What's the standard evaluation sequence? Procedural memories are the distillation of repeated episodic experience into generalized process knowledge.

Most production agent memory systems implement only episodic memory — storing conversation histories and retrieving relevant past interactions. Semantic and procedural memory are harder to implement but higher-value: they represent genuinely accumulated intelligence.

Why Stateless Agents Fail at Enterprise Tasks

In consumer contexts, the failure is obvious — users find it frustrating to repeat themselves. In enterprise contexts, the failure is structural.

Enterprise tasks are multi-session. A recruiting workflow unfolds over days or weeks: JD refinement, sourcing criteria calibration, batch screening with iterative feedback, final evaluation, scheduling. An agent that starts fresh each session can't participate in a workflow that spans sessions.

Calibration is cumulative. The recruiter's feedback — this score was too high, this criteria doesn't capture what we care about — is valuable training signal. An agent that can't remember calibration feedback can't improve. Every batch starts from scratch.

Organizational context is implicit. Much of what makes an evaluation correct for a specific company lives in the minds of people who've been doing this work for years. It surfaces in offhand comments, corrections, and preferences during sessions. An agent that can't accumulate this implicit context is structurally limited.

The Retrieval-Augmented Memory Pattern

The architecture that works at scale: a persistent memory store that the agent retrieves from at session start and writes to at session end.

The memory store has two components:

A vector database that holds embedded representations of past interactions, learned facts, and accumulated context. When a new session starts, the agent retrieves the most relevant memories based on current context — role, organization, task type — and loads them into working memory as a structured context block.

A structured metadata store that holds explicit, typed facts: user preferences, calibration settings, organizational parameters. These are retrieved by direct lookup rather than embedding similarity. If there's a stored preference for a specific rubric format, you don't need similarity search to find it.

The retrieval step happens before the session's first inference call. The write step happens at session end (or at defined checkpoints): new facts, corrections, and preference signals are extracted and written to the appropriate store.

The Memory Corruption Problem

The hardest unsolved problem in agent memory: how do you invalidate stale memories?

Facts change. A recruiter's preference might shift after hiring manager feedback. A job description might evolve between posting rounds. An organizational calibration from six months ago might be outdated.

Stale memories that contradict current reality are worse than no memories. The agent confidently applies outdated context, and the resulting outputs are wrong in ways that are hard to detect.

Approaches that work, with tradeoffs:

TTL-based expiration. Memories have a defined time-to-live. After expiration, they're flagged as potentially stale and require confirmation. Fast to implement but crude — some memories should be permanent, some should expire quickly, and a uniform TTL doesn't capture this.

Confidence decay. Memories accumulate confidence scores based on recency and confirmation. Older, unconfirmed memories have lower confidence and are less aggressively applied. More nuanced than TTL, requires more infrastructure.

Explicit invalidation on contradiction. When a session produces information that directly contradicts a stored memory, the contradiction is detected, the old memory is flagged, and the agent surfaces the conflict for resolution. The most accurate approach but requires building contradiction detection — a non-trivial NLP problem.

We use a combination: confidence decay for general memories and explicit invalidation for high-consequence memories (calibration parameters, organizational evaluation criteria).

Privacy and Compliance: The Enterprise Non-Negotiable

For enterprise customers, the memory architecture is a compliance conversation before it's a product conversation.

Enterprise customers do not want their organizational data stored in a shared memory system accessible to other customers. Tenant isolation at the memory layer is table stakes:

Every memory record tagged to its owning tenant
Retrieval scoped to that tenant's data only
Separate namespace in the vector store per tenant
Memory operations gated behind tenant-scoped authentication
Configurable retention policies per tenant
Verified data purge on customer request

For SOC2 compliance, you also need to answer: who has access to the memory store? What is the retention policy? Can tenant data be completely purged on request? These are contract requirements, not engineering afterthoughts.

When a customer's security team asks "where does your AI system store information about our organization?" — the answer needs to be specific, accurate, and auditable. That answer has to be built into the architecture from the beginning.

Procedural Memory as a Competitive Advantage

Episodic + Semantic Memory

Makes agents more contextually aware and better informed. The agent knows what happened and what is true about this customer's preferences.

Procedural Memory

Makes agents genuinely more capable. The agent has distilled how this customer approaches hiring — what signals matter, common disqualifiers, calibration shifts for different role types.

Concretely: over time, an AI screening agent that accumulates procedural memories develops a distilled model of how a specific customer approaches hiring — what signals matter, what disqualifiers are common, where calibration shifts for different role types. That model improves the agent's default behavior in ways that don't require explicit teaching every session.

This is the architectural vision behind agent memory: not just persistence, but genuine organizational learning. The agent that has been serving a customer for two years is qualitatively better than the agent on day one — not because the model weights changed, but because the memory system has accumulated a rich, validated model of that customer's context.

// key takeaway

Getting to real cross-session learning requires all the pieces: retrieval-augmented memory, corruption management, tenant isolation, confidence decay. None of those are glamorous. But they're the foundation of agents that actually get better over time — which is what "intelligence" actually means in the context of a production AI system.