ReasoningBank: The Paper, Explained

A beginner-friendly guide to how AI agents can learn strategies from their own successes and failures, and use them to keep getting better. Every AI term is defined. Every concept is grounded in analogy.

Paper by Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, Tomas Pfister (Google Cloud AI Research + UIUC + Yale, 2025) • Accepted to ICLR 2026 • Explainer published April 2026

made withHyperFrames The closed-loop process: retrieve strategies from memory, act, self-judge, then extract and consolidate new strategies back into the bank.

The Big Picture

When you put an LLM agentAn AI system built on top of a large language model that can take actions in an environment (clicking links, running code, calling APIs) over multiple steps to achieve a goal. to work on a long stream of real tasks — answering customer questions, filing bugs, shopping online — something strange happens:

It never gets better. Task 100 is approached exactly like task 1. Every mistake the agent made yesterday is fair game to repeat today. Every clever trick it discovered last week is forgotten by next Monday.

This paper asks: what if we gave the agent a proper notebook? Not a transcript of everything it ever did, but a curated set of lessons learned — short, reusable reasoning strategies — pulled from both its wins and its failures. The authors call this notebook ReasoningBank.

The paper tackles three problems at once:

Agents don't learn from their own history. Existing memory systems either dump raw click-by-click recordings (too long and noisy to reuse) or save only successful workflows (throwing away the hard-won lessons from failures).
Test-time scaling doesn't know about memory. When people throw more compute at a task at inference time — running the agent five times instead of once — each run is independent. The signal from comparing those runs goes unused.
The two together should be more than the sum of their parts. Good memory should make scaled exploration smarter; scaled exploration should produce better raw material for memory.

ReasoningBank stores strategies, not recordings. Each memory item has a title, a one-sentence description, and a short content body — like an index card of advice. Crucially, these cards are written by the agent itself, distilled from both its successes and its failures. The same agent, with no extra training, solves up to 20% more tasks and uses up to 16% fewer steps on hard benchmarks once it starts keeping this notebook.

Background Concepts

The paper assumes you already understand how modern agents work and what "test-time scaling" means. Let's unpack both.

What is an LLM agent?

An LLM agentAn AI system that wraps a language model inside a loop: observe the environment → think → take an action → observe the result → think again. Each step is one LLM call. A "trajectory" is the whole sequence of thoughts and actions it takes to finish a task. is a language model put inside a loop. At each step the model sees the current state of the world (a webpage, a code file, a terminal), writes out its reasoning, and chooses an action (click, type, run a shell command). The action changes the world, the new state comes back, and the loop continues until the task is done.

The complete log of thoughts and actions for one task is called a trajectoryThe full step-by-step record of an agent solving one task: every thought it wrote, every action it took, every response it got back. A single trajectory can easily run 20+ steps and thousands of tokens for a real web task.. A trajectory can be dozens of steps long.

A new employee on their first day. They read the ticket, think out loud about what to try, click a button, see what happens, then try again. The trajectory is their entire work log, including every dead end.

What are existing agent memory systems?

People have tried to give agents memory before. The two dominant flavors:

Trajectory memory

Store entire past trajectories verbatim. When a new task comes in, retrieve the most similar past one and include it in the prompt as a demonstration.

<action> click('188')
<think> Account page loaded. User wants first purchase date. Scroll down...
<action> click('1530')
<think> Orders sorted newest first. Need oldest...
<action> click('1614')

Problem: trajectories are long, noisy, and tied to specific page IDs that won't match the next task.

Workflow memory

Abstract past trajectories into reusable "workflows" — short procedural recipes. But only from tasks that succeeded.

Workflow: Retrieve information from specific page section
1. click(section_or_tab_id)
2. send_msg_to_user(extracted_info)

Problem: throws away failures. Never learns "don't do X" — only "do Y."

ReasoningBank keeps the workflow idea (abstract, not verbatim) but adds two crucial things: memory items are strategies/reasoning hints, not step-by-step procedures, and failures are a first-class source of memory.

What is test-time scaling?

Test-time scalingSpending more compute at inference time (when the model is being used, not when it's being trained) to improve quality. Common flavors: run the model multiple times and pick the best answer (best-of-N), let it think longer, let it revise its own answer. means throwing more compute at a problem when you're using the model, not when you're training it. Instead of training a bigger model, you ask the existing model to try the same problem five times, or to think for longer, or to critique and revise its own answer.

It's been a hot area since 2024 — it's how reasoning models like o1 and Gemini Thinking get extra performance. But most work on test-time scaling focuses on single-shot problems (math, coding). This paper asks: what's the right way to do it for agents, where each "attempt" is a long multi-step trajectory?

Two common test-time scaling flavors

Parallel scaling (best-of-N): run the model N times independently on the same input. Pick the best output, or vote.

Sequential scaling (self-refinement): run the model once, then feed its output back in and ask it to improve. Repeat.

Both are "spend more compute per query." Neither, on its own, shares anything across queries. That's where memory comes in.

LLM-as-a-Judge

A core ingredient of ReasoningBank is LLM-as-a-JudgeAsking a language model to evaluate an output. For example, "Given this task and this trajectory, did the agent succeed? Answer yes or no." The judge doesn't know the ground-truth answer — it just reads the transcript and makes its best call.: after the agent finishes a task, another LLM reads the trajectory and labels it success or failure — without access to any ground-truth answer. This is how the system gets supervision signal in the wild, where no answer key exists.

A senior developer reviewing a pull request without the test suite. They can't run the tests, but they've seen enough good and bad PRs to make a reasonable call from the code alone.

Embedding-based retrieval

To find "memories relevant to the current task," ReasoningBank converts each memory item's title and description into an embeddingA vector of numbers that captures the meaning of a piece of text, such that similar texts have similar vectors. Produced by an embedding model. Used for semantic search: to find "things like X," embed X and find stored embeddings close to it. — a numerical fingerprint of its meaning. It then embeds the new task query and grabs the top-k memories whose fingerprints are most similar. This is the same trick that powers semantic search.

The benchmarks

The paper evaluates on three environments that simulate real agent work:

WebArena

A suite of real websites (Shopping, Admin, Gitlab, Reddit) wired up for agents to use. Tasks like "find my first order" or "close issue #42." Measures task success rate.

Mind2Web

A broader web benchmark testing generalization across websites and domains the agent has never seen during memory building.

SWE-Bench-Verified

Real GitHub issues in real Python repos. The agent has to read the codebase, find the bug, and write a patch that passes the project's test suite.

How ReasoningBank Works

The whole system is a closed loop with four parts: the memory bank itself, and three operations that move information in and out of it.

New task query

Memory Retrieval

embed the query, pull the top-k most similar memory items, inject them into the agent's system prompt

ReasoningBank
a growing collection of structured memory items — each one is a title + description + content body capturing a reusable strategy

Agent + Environment

guided by retrieved memories, the agent interacts with the environment and produces a trajectory

Memory Extraction

LLM-as-a-Judge labels trajectory success/failure → prompt-based extractor distills new memory items (different prompts for success vs failure)

Memory Consolidation

new items are added back to the bank. Next task gets an even better pool to retrieve from.

Bank grows • agent sees next task

The memory schema

Every memory item has three fields:

Title — a short handle like "Prioritize user account sections for personal data." One line. Used for retrieval.
Description — a single sentence summarizing when this strategy applies. Also used for retrieval.
Content — the actual advice: a short chunk of reasoning or decision rules the agent should apply. Injected into the prompt when retrieved.

The items are meant to be read by both humans and the agent. A human reviewing them should be able to say "yes, that's a reasonable heuristic." A model reading them should be able to act on them.

Retrieval: match by meaning, not keywords

When a new task arrives, the system embeds the task query and compares it against the stored embeddings of every memory item's title + description. The top-k closest items (typically k = 3-5) are prepended to the agent's system instruction.

Searching your own notes before a client call. You don't want every note you've ever written — you want the three that are actually relevant to today's meeting.

Extraction: different prompts for wins and losses

After the agent finishes, LLM-as-a-JudgeAn LLM asked to read the trajectory and output "success" or "failure" without ground-truth supervision. The paper shows this judge is about 73% accurate on WebArena-Shopping — and that ReasoningBank stays robust even when the judge's accuracy drops as low as 70%. labels the trajectory. Then:

Successful trajectories are fed to a "distill what worked" prompt that extracts the reasoning patterns that led to success.
Failed trajectories are fed to a different prompt that extracts preventative lessons — "when you see pattern X, don't do Y."

Each trajectory can produce multiple memory items. A 20-step web task that solved a shopping query might yield three separate lessons: one about navigation, one about filtering, one about cross-referencing results.

Consolidation: just add it to the pile

The paper deliberately keeps this simple — new items are appended to the bank with no clever deduplication or rewriting. Part of the argument is that even this naive setup works well; fancier consolidation is future work.

Why doesn't the bank explode in size?

After running all 684 WebArena tasks, the bank typically holds a few thousand memory items — each one short. The bottleneck is retrieval quality, not storage: you only ever inject the top few items into the prompt, so even a large bank costs little at inference time.

The authors note that smarter consolidation (merging duplicates, pruning stale items) is an obvious extension but was kept out of this paper to keep the contribution focused.

Robust to a noisy judge

A natural worry: if the LLM-as-a-Judge mislabels trajectories, bad lessons could get written into memory. The authors simulate judges of varying accuracy and find that ReasoningBank's performance stays nearly flat from 100% down to about 70% judge accuracy. In the actual experiments, the judge is about 72.7% accurate — noisy but workable.

MaTTS: Memory-Aware Test-Time Scaling

ReasoningBank is already a win on its own. But the authors push further: what if, instead of tackling each task in one shot, the agent tackles it multiple times and uses the contrast between those attempts to write better memory?

They call this Memory-aware Test-Time Scaling (MaTTS), and it comes in two flavors.

made withHyperFrames Parallel scaling runs many trajectories at once and uses self-contrast. Sequential scaling refines one trajectory iteratively and uses self-refinement.

Parallel scaling: run k trajectories, then self-contrast

Take the task, run the agent k independent times (each guided by retrieved memory). You'll typically get a mix of successes and failures. Now ask an LLM to look at all k trajectories together and extract the strategies that consistently appear in the successes but not in the failures. That's self-contrastLooking at multiple attempts at the same task and identifying patterns by comparison — what the successful runs did differently from the failed ones. Borrowed from contrastive learning, where signals come from comparing positive and negative examples..

This is qualitatively different from standard "best-of-N" scaling. Best-of-N picks one trajectory and throws the rest away. Self-contrast uses all k trajectories to produce a stronger single piece of memory that can help future tasks.

Sequential scaling: refine one trajectory k times

Take the task, run it once, then critique the result and revise. Do that k times in sequence. This is classic self-refinementAfter producing an answer, the model critiques its own answer and produces a revised version. Originally proposed by Madaan et al. 2023 for single-turn tasks; MaTTS applies it to multi-step agent trajectories.. MaTTS grabs not just the final version but all the intermediate critiques and revisions — those contain useful reasoning signals that don't survive into the final trajectory.

The synergy: memory and scaling amplify each other

Better memory makes scaling more productive (the agent explores in the right neighborhood). Scaling produces better raw material for memory (contrasts between runs are a powerful teaching signal). Without memory, scaling barely helps; with weak memory, scaling sometimes hurts. Only ReasoningBank + MaTTS shows a clean upward curve as k grows.

The numbers: on WebArena-Shopping with Gemini-2.5-flash and k=5 parallel trajectories, Pass@1A quality-of-a-single-try metric: pick one of the k trajectories at random and ask whether it succeeded. Measures average-case quality, not best-case. rises from 49.7% (ReasoningBank alone) to 53.0% (with MaTTS). Best-of-5If you run the task 5 times, what fraction of tasks have at least one successful trajectory? Measures the ceiling of what scaling can give you if you had a perfect way to pick the winner. rises from 49.7% to 55.1%. For comparison, the memory-free baseline goes from 39.0% to just 42.2% under the same scaling.

Emergent Behaviors

Because memory items accumulate and get re-retrieved as the agent keeps working, the same "strategy" can get rewritten over time. The authors find that these strategies don't just stay the same — they mature.

made withHyperFrames One memory item ("User-Specific Information Navigation") observed at four points in its lifetime — growing from a concrete action rule into high-level compositional reasoning.

The progression they observe, illustrated above, looks like this:

Procedural. Early memory items are literal: "click Next Page, Page X, or Load More links." Concrete action rules.
Self-reflection. Next stage: the agent learns to catch its own mistakes. "Before clicking, re-check the element's current identifier."
Adaptive check. It starts combining tools: "Before scanning, leverage any available search or filter functionality; ensure completeness."
Compositional. Eventually: "Regularly cross-reference the current view with the task requirements. If data doesn't align, reassess available options such as search filters." This is task-agnostic high-level reasoning.

This is reminiscent of what happens in reinforcement learningA training paradigm where a model learns by trying actions, seeing what happens, and adjusting its weights to favor actions that lead to reward. Requires gradient updates — which ReasoningBank does not do. as policies mature from reactive to strategic — except it's happening purely at test time, with no gradient updatesThe step where a neural network's weights are actually changed during training, based on the gradient of the loss function. ReasoningBank never does this — the model's weights stay frozen. All the learning happens in the external memory bank.. The agent is literally rewriting its own strategy sheet as it gains experience.

Failures are where the compositional strategies come from

The authors run an ablation where they strip failed trajectories out of memory construction. Performance drops from 49.7 to 46.5 on WebArena-Shopping. More strikingly, the competitor baselines (Synapse, AWM) either stall or regress when you try to add failure data to them — they were designed for success-only extraction and can't absorb the new signal. ReasoningBank's extraction prompts were built from the start to turn failures into preventative lessons, so the added data actually helps.

Experiments

Setup

The paper tests three model backbones: Gemini-2.5-flash, Gemini-2.5-pro, and Claude-3.7-sonnet. Agents use ReActA standard agent prompting pattern: the agent alternates between explicit "Thought:" and "Action:" outputs at every step. Makes reasoning visible and easy to parse.-style prompting with default decoding settings.

Baselines compared against:

No Memory — bare agent, no memory module at all.
Synapse — trajectory-based memory. Stores raw past trajectories and retrieves similar ones as demonstrations.
AWM (Agent Workflow Memory) — workflow-based memory. Abstracts successful trajectories into reusable procedures.

Metrics: Success Rate (SR, higher is better) and average Steps (lower is better).

Results

WebArena overall

ReasoningBank beats every baseline on every model on every subset. A sampling on Gemini-2.5-flash:

Method	Shopping SR	Admin SR	Gitlab SR	Reddit SR	Overall SR	Avg Steps
No Memory	39.0	44.5	33.9	55.7	40.5	9.7
Synapse	40.6	45.1	35.6	59.4	42.1	9.2
AWM	44.4	46.7	37.2	62.3	44.1	9.0
ReasoningBank	49.7	51.1	40.6	67.0	48.8	8.3
+ MaTTS	53.0	53.8	42.8	70.8	51.8	7.9

ReasoningBank alone lifts overall success rate from 40.5 → 48.8 (a relative gain of ~20%) and cuts steps from 9.7 → 8.3. Adding MaTTS pushes success to 51.8. These gains hold across Gemini-2.5-pro and Claude-3.7-sonnet as well, with smaller but consistent improvements.

Generalization: cross-domain tasks

On Mind2Web's cross-domain split — where test websites are entirely different from training — ReasoningBank shows its biggest relative gains. AWM's workflow memory, which was designed around narrow procedural patterns, sometimes hurts in this setting because the procedures don't transfer. Strategy-level memory transfers better.

Software engineering

On SWE-Bench-Verified (real GitHub issues), ReasoningBank lifts resolution rate from 34.2% to 38.8% on Gemini-2.5-flash and from 54.0% to 57.4% on Gemini-2.5-pro. Average steps drop from 30.3 to 27.5 on flash — real money saved on long-horizon code tasks.

Efficiency: fewer steps, especially on the tasks you solve

A desirable memory system cuts down exploration on problems the agent is going to solve, not just truncates doomed trajectories. The paper shows ReasoningBank does the former: the step reduction is much larger on successful trajectories (up to 2.1 fewer steps, 27% relative) than on failures (0.2-1.4 fewer). The agent isn't giving up earlier — it's finding the right path faster.

Final Quiz

What makes ReasoningBank different from prior memory systems like Synapse and AWM?

It uses a bigger language model. It stores abstracted strategies from both successes and failures, instead of raw trajectories or success-only workflows. It fine-tunes the model on past trajectories. It uses reinforcement learning instead of prompting.

In MaTTS parallel scaling, how does running k trajectories help memory curation?

It produces more training data for fine-tuning. Only the best trajectory is stored; the rest are discarded. The system uses self-contrast — looking at what successful runs did that failed runs didn't — to extract stronger memory items. It averages the k trajectories together into one.

How does ReasoningBank know whether a trajectory succeeded if there's no ground-truth answer at test time?

It uses the environment's built-in reward signal. A separate LLM acts as a judge, reading the trajectory and labeling it success or failure. The system is robust down to ~70% judge accuracy. A human labels every trajectory. It assumes every trajectory is a success.

Why do the authors show that ReasoningBank's step-count reduction is larger on successful trajectories than failed ones?

The agent simply gives up earlier on hard tasks. Because the memory helps the agent find the right path faster on solvable tasks, rather than just truncating doomed attempts. Because failed tasks are easier than successful ones. Because they filter out failed trajectories from the count.

Without memory, what does test-time scaling (running k trajectories) do for success rate?

It produces massive gains — memory is unnecessary. It yields only small and inconsistent gains; extra rollouts offer diminishing returns without guidance. It actively hurts performance in every setting. It works identically whether memory is present or not.

What's the "emergent behavior" the authors describe?

The agent suddenly starts refusing to do tasks. Memory items evolve over time from concrete action rules to high-level compositional reasoning strategies — without any weight updates. The model grows new parameters. The agent starts writing memory items in a new language.

Why This Paper Matters

For builders and practitioners

If you're deploying an agent in production — a customer support agent, a web automation agent, an ops-assist agent — ReasoningBank is the rare "free" win: no fine-tuningTraining a pretrained model further on your own data to adapt it to a specific task. Requires GPUs, training data, and weight updates. ReasoningBank avoids all of this., no new infrastructure, just a prompt-and-retrieval layer that your agent gets better over time. The authors report 20% relative success-rate gains and 16% fewer steps, which for an LLM-call-bottlenecked product translates directly to better outcomes and lower inference cost. The memory bank is also human-readable, so engineers can audit and prune what's in it.

The robustness to an imperfect judge is important practically — you don't need a gold-standard evaluator, just a good-enough LLM reviewing its own work.

For the research community

Two findings are load-bearing. First: failures are useful training signal at test time, not just at train time. The dominant paradigm has been to filter for successful demonstrations; ReasoningBank shows that with the right extraction prompts, a failed trajectory teaches you things a successful one can't.

Second: memory and test-time scaling are complements, not substitutes. Before this paper, they were studied in parallel silos. The synergy curve — where weak memory hurts scaling and strong memory amplifies it — suggests the two should be co-designed going forward. The authors frame "memory-driven experience scaling" as a new scaling axis alongside model size and test-time compute.

The bigger picture

This is a paper about where additional capability comes from once you've frozen the weights. For years the answer was "train a bigger model" or "give it more training data." More recently it's been "let it think longer at test time." ReasoningBank points at a third lever: let the agent accumulate a library of lessons from its own experience, and reuse them. As agents move from one-shot tools to long-running systems handling thousands of tasks, the ability to self-improve without retraining stops being a nice-to-have and starts becoming the thing that separates an agent you can deploy from one you can't.

It also suggests a concrete path toward what survey papers have been calling "self-evolving agents." The emergent progression from procedural to compositional strategies — without any gradient updates — is a small proof-of-concept that lifelong-learning agents don't necessarily require lifelong training.

Glossary

LLM agent

A language model wrapped in a loop: observe → think → act → repeat. Can use tools, click buttons, run commands.

Trajectory

The full log of one task: every thought, action, and observation, start to finish.

ReasoningBank

A growing collection of structured memory items (title + description + content) distilled from past successes and failures.

Memory item

One entry in the bank: a short reusable reasoning strategy or heuristic, not a raw trajectory.

Test-time scaling (TTS)

Spending more compute at inference time (not training time) to get better results. Examples: best-of-N, self-refinement.

MaTTS

Memory-aware Test-Time Scaling. TTS combined with ReasoningBank: scaled exploration produces contrastive signal that feeds back into memory.

Parallel scaling

Run the agent k independent times on the same task. In MaTTS, all k trajectories feed self-contrast.

Sequential scaling

Run the agent once, then refine its output k times. In MaTTS, intermediate critiques also feed memory.

Self-contrast

Identifying strategies by comparing multiple attempts at the same task — what successful runs did differently from failed ones.

Self-refinement

A technique where the model critiques its own output and produces an improved version. Repeated iteratively.

LLM-as-a-Judge

Using a language model to evaluate outputs — here, to label trajectories success or failure without ground truth.

Embedding

A numerical fingerprint of a piece of text that enables semantic similarity search. Similar meanings → similar vectors.

ReAct

A prompting style for agents: alternate "Thought:" and "Action:" outputs at every step. Makes reasoning visible.

Pass@1 / Best-of-N

Two evaluation metrics: Pass@1 picks one trajectory at random; Best-of-N picks the best of N trajectories.

Success Rate (SR)

The fraction of tasks where the agent completes the goal. The primary effectiveness metric.

WebArena / Mind2Web / SWE-Bench

Three agent benchmarks: WebArena is a self-hosted suite of real websites, Mind2Web tests cross-domain web generalization, SWE-Bench-Verified is real GitHub issues.

Synapse

A prior memory baseline that stores raw past trajectories and retrieves similar ones.

AWM (Agent Workflow Memory)

A prior memory baseline that abstracts successful trajectories into reusable workflows/procedures.