Mem0: The Paper, Explained

A beginner-friendly guide to building AI agents with persistent long-term memory. Every AI term is defined. Every concept is grounded in analogy.

Paper by Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, Deshraj Yadav (Mem0, 2025) • Explainer published April 14, 2026

made withHyperFrames Mem0 extracts key facts from conversations and recalls them in future sessions — giving AI agents persistent memory.

The Big Picture

AI assistants like ChatGPT and Claude have a memory problem: they forget everything between conversations. Tell your assistant you're vegetarian on Monday, and by Wednesday it might suggest a steak restaurant. This paper introduces Mem0, a system that gives AI agents persistent memory across sessions.

Think about your relationship with a doctor you've seen for years. They remember your medical history, allergies, and past treatments without you repeating them every visit. Now imagine a doctor who forgets everything about you each time you walk in. That's the difference between an AI with Mem0 and one without it.

The paper solves three problems:

Context windows aren't enough — Even models with 128K or 200K token windows eventually overflow. More importantly, critical information (like dietary preferences) gets buried in thousands of tokens of unrelated conversation.
Existing approaches are inefficient — You could dump the entire conversation history into the model (slow, expensive, degrades attention), or use RAG to retrieve relevant chunks (misses nuance, retrieves too much or too little).
Memory needs structure — The paper introduces both a base system (natural language memories) and a graph-based variant that captures relationships between entities, enabling complex reasoning about who, what, when, and how things connect.

Mem0 achieves 26% better accuracy than OpenAI's built-in memory while using 91% less latency and 90% fewer tokens than processing the full conversation. It does this by extracting only the salient facts from conversations and maintaining them as a compact, updatable knowledge base.

Background Concepts

Context Windows

Every language model has a context windowThe maximum amount of text a model can process at once, measured in tokens (roughly word-parts). GPT-4 supports 128K tokens, Claude supports 200K, Gemini supports up to 10M. But bigger windows don't solve the memory problem — they just delay it. — a fixed amount of text it can "see" at once. Once a conversation exceeds this window, older information is lost. Even within the window, the model's ability to use distant information degradesResearch shows that language models pay less attention to information in the middle of long contexts (the "lost in the middle" effect). Critical facts buried among thousands of tokens of unrelated discussion may be effectively ignored. the farther away it is.

A context window is like a desk. You can only spread out so many papers before they start falling off the edge. Making the desk bigger helps, but eventually you need a filing cabinet (persistent memory) to store important documents you can pull out when needed.

Retrieval-Augmented Generation (RAG)

RAGA technique where the model retrieves relevant text chunks from a database before generating a response. The conversation history is split into chunks, each is converted to a numerical representation (embedding), and at query time, the most similar chunks are retrieved and fed to the model as context. is a common approach: split the conversation into chunks, store them in a database, and retrieve the most relevant chunks when answering a question. But RAG has limitations for memory:

It retrieves raw text chunks, not distilled facts — lots of irrelevant context comes along
Chunk boundaries are arbitrary and can split important information
It can't update or consolidate information — if a user corrects something, both the old and new information persist

Vector Embeddings

EmbeddingsA way to represent text as a list of numbers (a vector) that captures its meaning. Texts with similar meanings produce similar vectors, allowing you to find related content by computing the mathematical distance between vectors. Used in both RAG and Mem0 for finding relevant memories. are how Mem0 finds relevant memories. Each memory is converted to a list of numbers that captures its meaning. When a new question comes in, it's also converted to numbers, and the system finds memories with the most similar numerical representation.

Knowledge Graphs

A knowledge graphA data structure where information is stored as entities (nodes) connected by labeled relationships (edges). For example: (Alice) --[lives_in]--> (San Francisco). This structure makes it easy to answer questions that require following chains of relationships. stores information as a network of entities and relationships. Instead of storing "Alice lives in San Francisco" as a flat text string, it stores: (Alice) —[lives_in]→ (San Francisco). This structure enables reasoning across connected facts.

How Mem0 Works

Mem0 processes conversations incrementally — one message pair at a time. Each time a user sends a message and gets a response, Mem0 runs a two-phase pipeline: extraction then update.

made withHyperFrames The extraction phase pulls salient facts from message pairs. The update phase decides whether to ADD, UPDATE, DELETE, or ignore (NOOP) each fact.

Phase 1: Extraction

When a new message pair arrives (user message + assistant response), the system builds a comprehensive prompt containing:

The new message pair
A conversation summary that captures the overall context (updated asynchronously in the background)
The last 10 messages for recent context

This prompt is sent to an LLM, which extracts a set of salient memories — the important facts worth remembering from this exchange.

It's like a personal assistant who sits in on all your meetings. After each meeting, they don't transcribe every word — they note down just the key decisions, action items, and important facts: "Client prefers blue theme," "Deadline moved to March 15," "Budget approved for $50K."

Phase 2: Update

Each extracted memory is then compared against existing memories using semantic similarityA measure of how close two pieces of text are in meaning (not just word overlap). Computed by comparing their vector embeddings. "I'm vegetarian" and "I don't eat meat" would have high semantic similarity even though they share few words.. The system retrieves the 10 most similar existing memories and uses the LLM to decide one of four operations:

ADD

No matching memory exists. Create a new one.

UPDATE

An existing memory covers the same topic but the new info adds detail. Merge them.

DELETE

New information contradicts an existing memory. Remove the outdated one.

NOOP

The information already exists or isn't worth storing. Do nothing.

The LLM itself decides which operation to use via function calling (also called "tool use"). Rather than building a separate classifier, Mem0 leverages the LLM's reasoning to figure out the relationship between new and existing information.

Graph Memory (Mem0^g)

The base Mem0 stores memories as natural language text. The graph variant adds a structured layer on top: a directed labeled graphA graph where connections have both a direction (from A to B, not just "A and B are connected") and a label describing the relationship type. For example: (Alice) --[prefers]--> (vegetarian food) is directed (Alice does the preferring) and labeled ("prefers"). stored in Neo4jA popular graph database that stores data as nodes and relationships rather than tables and rows. Optimized for traversing networks of connected data, making it ideal for following chains of relationships like "Alice lives in SF, SF is in California, California has warm weather.".

How It Builds the Graph

Entity ExtractionAn LLM identifies entities (people, places, objects, events) and their types from the conversation.

Relationship GenerationA second LLM pass identifies meaningful connections between entities, producing triplets like (Alice, lives_in, San Francisco).

Graph IntegrationNew entities are matched against existing nodes using embedding similarity. If a match is found, the existing node is reused; otherwise, a new node is created.

Conflict ResolutionWhen new relationships conflict with existing ones (e.g., Alice moved from SF to NYC), an LLM-based resolver marks the old relationship as invalid rather than deleting it, preserving temporal history.

made withHyperFrames Graph memory stores entities as nodes with labeled relationships. Dual retrieval finds answers by both traversing entity connections and matching query embeddings against all triplets.

Dual Retrieval

When answering a question, Mem0^g uses two complementary retrieval strategies:

Entity-centric: Identify entities in the query, find their nodes in the graph, and explore all connected relationships to build a relevant subgraph
Semantic triplet: Encode the entire query as an embedding and match it against all relationship triplets, returning the most similar ones

Entity-centric retrieval is like looking up a person in a contacts app and seeing all their info. Semantic retrieval is like typing a question into a search bar and getting the most relevant results. Mem0^g does both and combines the results.

Evaluation Setup

The LOCOMO Benchmark

The paper evaluates on LOCOMOA benchmark for testing long-term conversational memory. Contains 10 extended conversations (~600 messages and ~26,000 tokens each), with ~200 questions per conversation spanning four types: single-hop (one fact), multi-hop (combining facts), temporal (time-based), and open-domain (general knowledge)., which contains realistic multi-session conversations between two people discussing daily life. The questions test four types of memory:

Single-hop — Finding one specific fact ("What is Alice's favorite food?")
Multi-hop — Combining facts across sessions ("Where did Alice go after visiting the restaurant she mentioned last week?")
Temporal — Time-dependent reasoning ("What did Alice do before moving to New York?")
Open-domain — General knowledge integration ("What kind of activities does Alice enjoy?")

Metrics

Beyond traditional text-overlap metrics (F1, BLEU), the paper uses LLM-as-a-JudgeUsing a separate, capable LLM to evaluate the quality of generated answers by comparing them against ground truth. More reliable than word-overlap metrics because it can recognize that "I don't eat meat" correctly answers "Is Alice vegetarian?" even though the words are different. Run 10 times and averaged to account for randomness. (called "J") as the primary quality metric — a separate LLM evaluates whether the answer is factually correct, not just whether it shares words with the reference answer.

Results

Method	Single-Hop (J)	Multi-Hop (J)	Open-Domain (J)	Temporal (J)
MemGPT	—	—	—	—
A-Mem	39.8	18.9	54.1	49.9
LangMem	62.2	47.9	71.1	23.4
Zep	61.7	41.4	76.6	49.3
OpenAI Memory	63.8	42.9	62.3	21.7
Best RAG	~61	~61	~61	~61
Full Context	Overall J = 72.9 (highest accuracy, but slowest)
Mem0	67.1	51.2	72.9	55.5
Mem0^g	65.7	47.2	75.7	58.1

Key takeaways:

Mem0 wins on single-hop and multi-hop — Its natural language memories are ideal for direct fact retrieval and synthesizing across sessions
Mem0^g wins on temporal reasoning — The graph structure captures event sequences and chronological relationships that flat text misses
Zep leads on open-domain by a narrow margin, but its memory graph consumes 600K+ tokens (vs Mem0's 7K) — 85x more storage
OpenAI's memory struggles with time — It frequently omits timestamps from extracted memories, causing poor temporal reasoning (J = 21.7)
Full context is most accurate overall (J = 72.9) but impractical — it processes all 26,000 tokens for every query

Latency & Efficiency

This is where Mem0 really shines for production use:

Method	Avg Tokens	Search p95 (s)	Total p95 (s)	Overall J
Full Context	26,031	—	17.1s	72.9
Zep	3,911	0.78s	2.93s	66.0
LangMem	127	59.8s	60.4s	58.1
Best RAG (k=2, 256)	512	0.70s	1.91s	61.0
Mem0	1,764	0.20s	1.44s	66.9
Mem0^g	3,616	0.66s	2.59s	68.4

Compared to full context processing, Mem0 delivers:

91% lower p95 latency (1.44s vs 17.1s)
93% fewer tokens (1,764 vs 26,031)
Comparable accuracy (66.9 vs 72.9 overall J)

LangMem, despite using only 127 tokens per query, has the worst latency (60 seconds) because its search process is extremely slow. Mem0 strikes the optimal balance: fast search (0.2s), moderate token usage, and high accuracy. For production AI agents that need to respond in real time, this tradeoff is critical.

Final Quiz

Why can't simply extending the context window solve the long-term memory problem?

Context windows are limited to 4K tokens Conversations eventually exceed any window, and attention degrades over distant tokens even within the window Language models can't process more than one session at a time Longer contexts make the model less intelligent

What are the four operations Mem0 can perform during the update phase?

CREATE, READ, UPDATE, DELETE ADD, UPDATE, DELETE, NOOP INSERT, MERGE, REMOVE, SKIP STORE, RETRIEVE, FORGET, IGNORE

Why does Mem0^g outperform base Mem0 on temporal reasoning tasks?

It uses a faster LLM for inference Its graph structure explicitly captures chronological relationships and event sequences It processes more tokens per query It stores the full conversation history

How does Mem0 achieve 91% lower latency than full-context processing?

It uses a smaller, faster language model It retrieves only compact, relevant memories instead of processing all 26,000 tokens of conversation history It caches all previous answers It runs on specialized hardware

Why does Mem0 use the LLM itself to decide memory operations (ADD, UPDATE, DELETE, NOOP) rather than a separate classifier?

Classifiers are too slow for real-time use The LLM's reasoning capabilities allow it to understand the semantic relationship between new and existing memories There aren't enough training examples for a classifier It's cheaper than training a classifier

Why This Paper Matters

Memory is arguably the missing piece in making AI agents truly useful over time. Today's language models are stateless by default — every conversation starts from scratch. This paper matters because it provides a practical, production-ready solution to that problem, not just a research prototype.

For AI Agent Builders

Mem0 demonstrates that you don't need to choose between quality and cost. The naive approach (dump everything into the context window) works but becomes prohibitively expensive as conversations grow. RAG is cheaper but loses nuance. Mem0 shows that a thin memory layer — extracting just the salient facts and maintaining them with ADD/UPDATE/DELETE operations — achieves near-full-context quality at a fraction of the cost. This makes long-running AI agents economically viable for production use cases like healthcare, tutoring, and customer support.

For the Research Community

The paper provides one of the most comprehensive benchmarks of memory approaches to date, comparing against six different baseline categories on a standardized benchmark. Two findings stand out:

Graph memory helps for temporal reasoning but not everywhere — Mem0^g excels when questions require tracking event sequences, but base Mem0 with natural language memories is actually better for single-hop and multi-hop retrieval. This suggests that the optimal memory representation depends on the type of reasoning required.
Existing solutions have surprising blind spots — OpenAI's memory struggles badly with temporal questions (J = 21.7) because it drops timestamps. Zep uses 85x more tokens than Mem0 for similar quality. These findings highlight how much room for improvement exists in production memory systems.

The Bigger Picture

As AI agents move from single-turn assistants to long-term collaborators — personal tutors that track learning progress over months, healthcare companions that remember medication changes, coding assistants that learn your codebase and preferences — persistent memory becomes a core infrastructure requirement, not a nice-to-have. This paper provides both the architecture and the evidence that it works.