Reduce Token Cost for LLMs: AI Agent Memory with Valkey and Mem0

Technical Deep Dive 2026-05-05 · Chaitanya Nuthalapati, Allen Samuels, Meet Bhagdev

AI agent memory turns a stateless large language model (LLM) into an assistant that knows its users and delivers personalized, relevant responses. Implementing an agent memory layer can cut token cost by up to 90% and keep responses under 2 seconds (Mem0 benchmarks). Without memory, responses lose relevance, users repeat themselves, and token costs climb as raw history accumulates.

This post walks through how agent memory works and provides practical guidance, including an implementation using Valkey and Mem0 (an open-source memory framework), when to use file-based context versus a store-backed memory layer, infrastructure requirements for the storage layer, and best practices.

Breaking Down AI Agent Interactions

A single turn in agentic applications consists of a message from the user and the agent response. Fulfilling that turn may require multiple steps: calling tools, querying databases, or reasoning through sub-tasks. State memory keeps track of task execution so the agent can continue from one step to the next. It records things like information collected, steps completed, and pending tasks. The state is often temporary, but in production systems it is commonly saved so work can resume after an interruption. For example, it ensures workflow continuity for tasks such as submitting an insurance claim or completing a multi-step checkout, so retries or users leaving mid-execution do not repeat the same work. It also supports multi-agent collaboration, where state acts as shared memory so one agent can work with another on the same workflow without each one having to reconstruct progress. State also supports human-in-the-loop flows, where the agent pauses for a human to provide input before continuing to the next step. You can use Valkey with frameworks such as LangGraph checkpoints for workflow-state persistence and resume primitives for this scope.

Beyond a single turn, a session is the full sequence of messages in one conversation. Session memory captures what the agent needs from the current conversation to interpret the next turn correctly. It is typically managed through message history, trimming, or compaction. This supports use cases such as resolving implicit references, where the user says "I liked the second one better" and the agent retains enough of the prior conversation to map those references to the right items. It also supports session continuity across devices and conversation-scoped constraints, such as "Keep this under $200". You can use Valkey with frameworks such as the Strands Valkey Session Manager to persist conversation history and agent state across distributed environments. Under the hood, this is the same session-store pattern that applications have used Valkey for across web and mobile workloads, applied to conversation messages and agent state.

When the user returns hours, days or weeks later and starts a new conversation, the agent needs context that outlasts any single conversation. Long-term memory includes persistent memory that the agent can reuse long after the current conversation ends. In practice, long-term memory often includes three forms: semantic memory for facts and preferences, episodic memory for past interactions and outcomes, and procedural memory for instructions or learned rules about how the agent should behave. Long-term memory helps the agent carry context across sessions, such as preferred languages, shopping choices, or shipping details so the user does not have to repeat them. You can use Valkey with frameworks such as Mem0 for memory operations including extraction, deduplication, consolidation, and scoped retrieval by user, agent, or session. Benchmarks from Mem0 show that memory can cut 90% of token costs and deliver sub-2s responses by retrieving only the memories that matter. For a full setup and step-by-step implementation walkthrough, follow the Mem0 blog.

How to Add Memory to Agents

There are two broad ways to add memory to agents, and the main distinctions are: who needs access to the memory, where it must be available, how long it should persist, and whether it must be queried dynamically. Broadly, memory is either file-based context loaded into the prompt at session start, or a store-backed memory layer that the agent reads from and writes to at runtime.

Stable, shared guidance that is known before a session begins belongs in file-based context and is typically used for coding conventions, build commands, architecture notes, review checklists, and other persistent project instructions. CLAUDE.md, Cursor Rules, and OpenAI AGENTS.md are examples of this pattern. Claude Code uses CLAUDE.md for persistent instructions read at session start, Cursor Rules provides persistent project, team, and user rules, and OpenAI Codex reads AGENTS.md before doing any work.

A store-backed memory layer is the better fit for dynamic, mutable memory learned during execution. For example, runtime-learned information like user preferences and task progress is tied to a specific scope and needs to remain available across the boundaries the agent operates in. A store-backed memory layer makes that practical by persisting runtime state, scoping it with identifiers and supporting centralized access control, auditing, and managed deletion.

Long-running agents make persistent memory outside the prompt a hard requirement. Anthropic describes the problem directly: agents "work in discrete sessions, and each new session begins with no memory of what came before," and their workaround is an explicit progress file and git history that act as a durable memory store the next session can read (Anthropic, Effective harnesses for long-running agents). These systems operate across many context windows, and often multiple sessions or collaborating agents, so the active prompt alone cannot serve as the system's memory. This is where a store-backed memory layer like Valkey becomes useful, persisting runtime state outside the prompt and making it retrievable on demand across long-running workflows.

Memory Layers: Under the Hood

This section uses Mem0 as an example to show what happens under the hood when the application writes and retrieves memory with Valkey. Mem0 represents each extracted fact as a discrete unit it calls a "memory", and stores it in Valkey with attributes for the memory content, scope, timestamps, metadata, and embedding. The sections below walk through how the application creates an index, stores memories, and retrieves memories against this index.

Index Creation

Mem0 creates a Valkey Search index over those documents so the application can combine scope and attribute filtering with vector similarity search. The schema looks like this:

FT.CREATE {collection_name} ON HASH PREFIX 1 {prefix} SCHEMA
    memory TEXT
    embedding VECTOR HNSW 6 TYPE FLOAT32 DIM {dims} DISTANCE_METRIC COSINE
    user_id TAG agent_id TAG run_id TAG
    created_at NUMERIC updated_at NUMERIC

The memory attribute stores the extracted fact in natural language, and the embedding attribute stores its vector representation. Scope attributes user_id, agent_id, and run_id let the system narrow retrieval to the right user, agent, or session before running vector search. created_at and updated_at timestamps support filtering, updates, and lifecycle management.

Memory Storage

Writing memories is a single call with conversation messages and identity attributes, and Mem0 handles extraction, deduplication, embedding, and storage.

messages = [ {"role": "user", "content": "I prefer to work with Python"},
             {"role": "assistant", "content": "Saved your language preference for Python."}]

memory.add(messages, user_id="user_123", agent_id="assistant", run_id="session_456")

Under the hood, the write path has three stages. First, the application sends the latest user message and assistant response to Mem0's memory-ingestion pipeline, which makes an LLM call to identify facts, preferences, or decisions worth storing. A single turn can yield multiple candidates, each handled independently in the next two stages. Second, for each candidate, Mem0 issues an FT.SEARCH against Valkey for similar existing memories and decides whether to add, update, delete, or ignore it.

FT.SEARCH {collection_name}
    "@user_id:{user_123} @agent_id:{assistant} @run_id:{session_456} =>[KNN 5 @embedding $vec AS vector_score]"
    PARAMS 2 vec <float32 bytes> DIALECT 2

Finally, for each candidate that resulted in an add or update, Mem0 writes the memory to Valkey with a single HSET, which also updates the index in place so the new or changed memory is immediately searchable.

HSET mem0:{memory_id}
    memory_id {id} hash {content_hash} user_id user_123 agent_id assistant run_id session_456 metadata {...}
    memory "I prefer to work with Python" embedding <float32 bytes>
    created_at {ts} updated_at {ts}

Memory Retrieval

Memory retrieval starts when the application issues a natural-language query within a chosen scope.

results = memory.search(
    query="What deployment preferences has this user shared?",
    user_id="user_123", run_id="123", agent_id="shopping_assistant",
    limit=5)

Under the hood, the read path has three stages. First, Mem0 converts the query into a vector. Second, Mem0 executes a hybrid search on Valkey by combining scope filters such as user_id, agent_id, or run_id as tag filters along with the query embedding to return the most semantically similar memories within that scope.

FT.SEARCH {collection_name}
    "@user_id:{user_123} @agent_id:{shopping_assistant} @run_id:{123} =>[KNN 5 @embedding $vec AS vector_score]"
    PARAMS 2 vec <float32 bytes> DIALECT 2

Finally, the top memories are returned with scores and metadata so the application can inject them into the next prompt. This lets the system retrieve only the small working set relevant to the current step instead of replaying the full conversation history on every turn.

Storage Layer Considerations

This section focuses on long-term memory, which puts the sharpest pressure on the storage layer. Unlike retrieval-augmented generation (RAG), which reads from a mostly static corpus, long-term memory is a live, mutation-heavy workload. A single interaction can produce several candidates, each triggering its own scoped lookup and an add, update, delete, or no-op decision. In practice, that means multiple small reads, searches, and writes per turn. The storage layer has to support the four requirements below so the agent can continuously learn, update, and reuse context.

1. Scoped ownership and isolation

Agent memory must preserve who a memory belongs to and where it applies. In practice, that means scoping by identifiers such as user_id, agent_id, and run_id. These attributes are declared as TAG type in the Mem0 index schema, so the system can combine ownership filtering with semantic search in a single query. Alternatively, scoping can be represented using key and hash operations through namespaces such as ("memories", "{user_id}") or ("users", "{user_id}", "profile"), which provide fast retrieval and strict isolation, as seen in LangGraph's store. The choice is a tradeoff: TAG attributes keep ownership mutable and combine cleanly with vector search, while namespaced keys give stricter isolation and faster bulk scans but make re-scoping and cross-scope semantic search harder.

2. Frequent & fast mutation with read-after-write visibility

A single interaction can trigger multiple small memory operations, including similarity checks, updates, deletions, and inserts. The storage layer therefore needs to handle frequent small mutations at low latency, both during live memory formation and as stored memories are continuously revised over time. Slow writes force a bad tradeoff. Either the application waits for each write to commit before the next turn, which adds latency to every interaction, or it lets the next turn proceed against stale memory, which causes missed conflict resolution and duplicate memories. The risk is amplified in systems where multiple turns race to read and update the same state, such as collaborating agents or operational memory like checkpoints and cross-session work logs that require exact state.

Valkey's short update latency removes the tradeoff by making memory updates become searchable in real time on the write path. In Valkey Search, the client that issued the mutation is blocked until the index updates complete, providing read-after-write visibility once the write returns. Multithreading strengthens this model under load: background worker threads process index updates, multiple parallel connections can saturate the indexing pipeline, and scaling to more vCPUs increases both ingest and query throughput.

3. Fast selective retrieval under concurrent reads and writes

The purpose of agent memory is to avoid loading the full conversation history or user profile into the prompt on every turn. Instead, the application stores facts externally and retrieves only the subset that is relevant to the current task. That means the storage layer needs to support selective retrieval under tight latency budgets, often combining scope filters such as user, agent, or session with search and ranking over the memory corpus. Further, the storage layer must sustain high concurrent read and write traffic without letting foreground retrieval slow down, because memory reads still sit on the live execution path for many agent turns. To add some numbers, Mem0's long-term memory architecture write-up suggests the operating point modern memory systems should target: support for up to 10K memories per user and sub-50 ms retrieval to retrieve 20 memories at the scale of multi-million 1536-dimensional embeddings.

Valkey Search combines scope filters and vector retrieval in a single query and chooses the least expensive hybrid execution strategy based on filter selectivity. It runs on a multi-threaded, in-memory index that scales query and ingest throughput linearly with more vCPUs, so memory maintenance and live retrieval run concurrently without pushing reads off the inference path. Valkey Search reports single-digit millisecond latency and over 99% recall at billions of vectors.

4. Mature caching and data life controls

Memory systems should provide Time-to-Live (TTL) controls to automatically expire memories to ensure freshness, and precise deletion controls for conflict resolution, privacy, and compliance. Valkey provides mature data life controls such as per-document expiration with EXPIRE and TTL, removal of expiration with PERSIST, and support for non-blocking deletion with UNLINK. Valkey also supports hash-field expiration and introspection with commands such as HEXPIRE and HTTL, making it easier to apply different retention windows within a single memory document. Together, these primitives let teams mix short-lived session memory, persistent long-term memory, and precise cleanup without building a separate lifecycle subsystem in the application.

Open Source Storage Options: A Side-by-Side

The table below uses the following scale to compare three popular vector search solutions. Best fit means a native, first-class primitive where the requirement is the default behavior. Strong fit means production-ready but may require configuration or carry known tradeoffs. Supported means achievable but with workarounds, application-level implementation, or tradeoffs that developers must actively manage.

Requirement	Valkey Search	Postgres + pgvector	OpenSearch
Selective scoped retrieval on the live turn	Best fit: one query can combine scope filters with vector retrieval. Valkey Search automatically chooses the least expensive hybrid-query execution strategy based on filter selectivity. (Valkey Search query syntax)	Supported, filtering is applied after the index scan with ANN indexes, so selective filters can reduce scoped recall unless you raise search effort or use iterative scans. (pgvector filtering)	Strong fit: OpenSearch provides efficient filtered k-NN, but behavior still depends on filtering mode and engine, so recall/latency tradeoffs are more configuration-sensitive. (OpenSearch k-NN filtering)
Resume, handoff, long-running work	Best fit: writes are acknowledged only after index updates complete, so the next search on the same primary sees the committed state, providing read after write by default with no refresh interval or per-request flag. (Valkey Search read-after-write)	Best fit: after the writer commits, `READ COMMITTED` ensures each subsequent statement sees rows committed before the query began, providing read-after-commit. (PostgreSQL Read Committed)	Strong fit: read-after-write is available via `refresh=true` or `wait_for`, but the default is `false`, so writes are not searchable until the next refresh interval. (OpenSearch refresh parameter)
High-throughput concurrent reads and writes	Strong fit: Valkey Search is multi-threaded for indexing and query work, and Valkey 8 improves I/O threading for high-QPS workloads. (Valkey 8 I/O threading)	Strong fit: Postgres supports non-blocking row-level concurrency, but high concurrency often requires external connection pooling. pgvector HNSW ingestion also adds vector-index maintenance overhead. (AWS pgvector HNSW blog)	Supported: OpenSearch supports bulk indexing for high-throughput writes, but under concurrent workloads, faster search visibility comes with a cost: `refresh=true` adds indexing overhead, while `wait_for` adds write latency. (OpenSearch Bulk API)
Expiry, deletion, retention	Best fit: built-in TTL, async delete, and attribute-level expiry make session expiry and targeted cleanup straightforward. (Valkey HEXPIRE)	Supported: PostgreSQL does not provide a built-in document TTL or EXPIRE-style primitive; lifecycle management is typically implemented with scheduled jobs such as `pg_cron`, while precise row-level cleanup is possible with `DELETE`. (pg_cron)	Supported: OpenSearch can auto-delete entire indexes on a schedule but has no built-in way to expire individual documents on a timer, so per-memory TTLs need to be implemented in the application. (OpenSearch ISM)

Production Patterns

Anonymous-to-Authenticated Migration: Many applications start with an anonymous session. When the user logs in, the agent needs to migrate session memories from the anonymous identifier to the authenticated user ID. In Valkey, this is a straightforward per-memory update, and because indexed attributes are refreshed on each write, the migrated memories become searchable under the new user ID immediately after the update completes:

# After user authenticates, migrate anonymous memories
anonymous_memories = memory.get_all(user_id="anon_session_abc")
for mem in anonymous_memories:
    memory.update(mem["id"], data={"user_id": "authenticated_user_123"})

This pattern depends on two Valkey properties established earlier: low-latency writes, so the migration completes in the time of a single user action, and read-after-write visibility on the Search index, so the next retrieval sees the memories under the new user_id.

Concurrency and Consistency: In distributed agent systems, memory bugs often come from overlapping writes or stale reads. Two agent turns may try to update the same memory at nearly the same time, causing lost updates, or one agent may read older state while another has already written a correction. Common mitigations are to serialize memory mutations per user or scope with a short-lived lock and, for high-impact attributes, version writes so downstream agents can require a minimum memory version before acting. Valkey provides the primitives for both patterns through SET NX PX for lightweight locks and INCR for atomic version counters, while the coordination policy lives in the application.

Observability & Auditing: Mem0 memory events and history help explain memory-driven LLM behavior, including why a fact was added, updated, deleted, or later retrieved, while Valkey provides storage- and index-level signals. If you need a Valkey-side audit stream for memory mutations, enable notify-keyspace-events and subscribe to keyspace or keyevent Pub/Sub channels; use MONITOR only for short-lived debugging. Latency regressions break turn responsiveness, so watch memory_search_latency_p99 and memory_write_latency_p99. Memory drift breaks personalization, so watch memory counts per user or scope, mutation rates by outcome (add, update, delete, noop), and retrieval score distributions. Storage-layer issues break everything, so watch Valkey's FT.INFO for index state and INFO for server state.

Memory Index Health: For indexes using the hierarchical navigable small world (HNSW), frequent deletes or overwrites can leave the index less efficient over time, increasing memory usage and degrading recall. In practice, monitor index health with FT.INFO and plan periodic rebuilds during low-traffic windows.

Conclusion

Agent memory turns stateless LLM calls into persistent, context-aware interactions. The core mechanics are consistent across frameworks: extract facts from conversations, embed for semantic retrieval, and scope access by user, agent, and session. The storage layer underneath determines how these operations perform at scale, particularly under the fan-out, write visibility, and lifecycle management patterns that distinguish agent memory from single-query RAG. This post walked through how memory works with Mem0 and Valkey, covering the schema design, write and read pipelines, scoped retrieval patterns, and lifecycle controls. The same architectural patterns apply regardless of which memory framework or storage backend you choose.

Start with the configuration above by following this blog, point it at your Valkey cluster, and iterate on your scoping and retention policies as your agent's memory needs evolve.