ReasoningBank: The Paper, Explained
A beginner-friendly guide to how AI agents can learn strategies from their own successes and failures, and use them to keep getting better. Every AI term is defined. Every concept is grounded in analogy.
The Big Picture
When you put an LLM agentAn AI system built on top of a large language model that can take actions in an environment (clicking links, running code, calling APIs) over multiple steps to achieve a goal. to work on a long stream of real tasks — answering customer questions, filing bugs, shopping online — something strange happens:
It never gets better. Task 100 is approached exactly like task 1. Every mistake the agent made yesterday is fair game to repeat today. Every clever trick it discovered last week is forgotten by next Monday.
This paper asks: what if we gave the agent a proper notebook? Not a transcript of everything it ever did, but a curated set of lessons learned — short, reusable reasoning strategies — pulled from both its wins and its failures. The authors call this notebook ReasoningBank.
The paper tackles three problems at once:
- Agents don't learn from their own history. Existing memory systems either dump raw click-by-click recordings (too long and noisy to reuse) or save only successful workflows (throwing away the hard-won lessons from failures).
- Test-time scaling doesn't know about memory. When people throw more compute at a task at inference time — running the agent five times instead of once — each run is independent. The signal from comparing those runs goes unused.
- The two together should be more than the sum of their parts. Good memory should make scaled exploration smarter; scaled exploration should produce better raw material for memory.
Background Concepts
The paper assumes you already understand how modern agents work and what "test-time scaling" means. Let's unpack both.
What is an LLM agent?
An LLM agentAn AI system that wraps a language model inside a loop: observe the environment → think → take an action → observe the result → think again. Each step is one LLM call. A "trajectory" is the whole sequence of thoughts and actions it takes to finish a task. is a language model put inside a loop. At each step the model sees the current state of the world (a webpage, a code file, a terminal), writes out its reasoning, and chooses an action (click, type, run a shell command). The action changes the world, the new state comes back, and the loop continues until the task is done.
The complete log of thoughts and actions for one task is called a trajectoryThe full step-by-step record of an agent solving one task: every thought it wrote, every action it took, every response it got back. A single trajectory can easily run 20+ steps and thousands of tokens for a real web task.. A trajectory can be dozens of steps long.
What are existing agent memory systems?
People have tried to give agents memory before. The two dominant flavors:
Trajectory memory
Store entire past trajectories verbatim. When a new task comes in, retrieve the most similar past one and include it in the prompt as a demonstration.
<think> Account page loaded. User wants first purchase date. Scroll down...
<action> click('1530')
<think> Orders sorted newest first. Need oldest...
<action> click('1614')
Problem: trajectories are long, noisy, and tied to specific page IDs that won't match the next task.
Workflow memory
Abstract past trajectories into reusable "workflows" — short procedural recipes. But only from tasks that succeeded.
1. click(section_or_tab_id)
2. send_msg_to_user(extracted_info)
Problem: throws away failures. Never learns "don't do X" — only "do Y."
ReasoningBank keeps the workflow idea (abstract, not verbatim) but adds two crucial things: memory items are strategies/reasoning hints, not step-by-step procedures, and failures are a first-class source of memory.
What is test-time scaling?
Test-time scalingSpending more compute at inference time (when the model is being used, not when it's being trained) to improve quality. Common flavors: run the model multiple times and pick the best answer (best-of-N), let it think longer, let it revise its own answer. means throwing more compute at a problem when you're using the model, not when you're training it. Instead of training a bigger model, you ask the existing model to try the same problem five times, or to think for longer, or to critique and revise its own answer.
It's been a hot area since 2024 — it's how reasoning models like o1 and Gemini Thinking get extra performance. But most work on test-time scaling focuses on single-shot problems (math, coding). This paper asks: what's the right way to do it for agents, where each "attempt" is a long multi-step trajectory?
Two common test-time scaling flavors
Parallel scaling (best-of-N): run the model N times independently on the same input. Pick the best output, or vote.
Sequential scaling (self-refinement): run the model once, then feed its output back in and ask it to improve. Repeat.
Both are "spend more compute per query." Neither, on its own, shares anything across queries. That's where memory comes in.
LLM-as-a-Judge
A core ingredient of ReasoningBank is LLM-as-a-JudgeAsking a language model to evaluate an output. For example, "Given this task and this trajectory, did the agent succeed? Answer yes or no." The judge doesn't know the ground-truth answer — it just reads the transcript and makes its best call.: after the agent finishes a task, another LLM reads the trajectory and labels it success or failure — without access to any ground-truth answer. This is how the system gets supervision signal in the wild, where no answer key exists.
Embedding-based retrieval
To find "memories relevant to the current task," ReasoningBank converts each memory item's title and description into an embeddingA vector of numbers that captures the meaning of a piece of text, such that similar texts have similar vectors. Produced by an embedding model. Used for semantic search: to find "things like X," embed X and find stored embeddings close to it. — a numerical fingerprint of its meaning. It then embeds the new task query and grabs the top-k memories whose fingerprints are most similar. This is the same trick that powers semantic search.
The benchmarks
The paper evaluates on three environments that simulate real agent work:
WebArena
A suite of real websites (Shopping, Admin, Gitlab, Reddit) wired up for agents to use. Tasks like "find my first order" or "close issue #42." Measures task success rate.
Mind2Web
A broader web benchmark testing generalization across websites and domains the agent has never seen during memory building.
SWE-Bench-Verified
Real GitHub issues in real Python repos. The agent has to read the codebase, find the bug, and write a patch that passes the project's test suite.
How ReasoningBank Works
The whole system is a closed loop with four parts: the memory bank itself, and three operations that move information in and out of it.
The memory schema
Every memory item has three fields:
- Title — a short handle like "Prioritize user account sections for personal data." One line. Used for retrieval.
- Description — a single sentence summarizing when this strategy applies. Also used for retrieval.
- Content — the actual advice: a short chunk of reasoning or decision rules the agent should apply. Injected into the prompt when retrieved.
The items are meant to be read by both humans and the agent. A human reviewing them should be able to say "yes, that's a reasonable heuristic." A model reading them should be able to act on them.
Retrieval: match by meaning, not keywords
When a new task arrives, the system embeds the task query and compares it against the stored embeddings of every memory item's title + description. The top-k closest items (typically k = 3-5) are prepended to the agent's system instruction.
Extraction: different prompts for wins and losses
After the agent finishes, LLM-as-a-JudgeAn LLM asked to read the trajectory and output "success" or "failure" without ground-truth supervision. The paper shows this judge is about 73% accurate on WebArena-Shopping — and that ReasoningBank stays robust even when the judge's accuracy drops as low as 70%. labels the trajectory. Then:
- Successful trajectories are fed to a "distill what worked" prompt that extracts the reasoning patterns that led to success.
- Failed trajectories are fed to a different prompt that extracts preventative lessons — "when you see pattern X, don't do Y."
Each trajectory can produce multiple memory items. A 20-step web task that solved a shopping query might yield three separate lessons: one about navigation, one about filtering, one about cross-referencing results.
Consolidation: just add it to the pile
The paper deliberately keeps this simple — new items are appended to the bank with no clever deduplication or rewriting. Part of the argument is that even this naive setup works well; fancier consolidation is future work.
Why doesn't the bank explode in size?
After running all 684 WebArena tasks, the bank typically holds a few thousand memory items — each one short. The bottleneck is retrieval quality, not storage: you only ever inject the top few items into the prompt, so even a large bank costs little at inference time.
The authors note that smarter consolidation (merging duplicates, pruning stale items) is an obvious extension but was kept out of this paper to keep the contribution focused.
Robust to a noisy judge
A natural worry: if the LLM-as-a-Judge mislabels trajectories, bad lessons could get written into memory. The authors simulate judges of varying accuracy and find that ReasoningBank's performance stays nearly flat from 100% down to about 70% judge accuracy. In the actual experiments, the judge is about 72.7% accurate — noisy but workable.
MaTTS: Memory-Aware Test-Time Scaling
ReasoningBank is already a win on its own. But the authors push further: what if, instead of tackling each task in one shot, the agent tackles it multiple times and uses the contrast between those attempts to write better memory?
They call this Memory-aware Test-Time Scaling (MaTTS), and it comes in two flavors.
Parallel scaling: run k trajectories, then self-contrast
Take the task, run the agent k independent times (each guided by retrieved memory). You'll typically get a mix of successes and failures. Now ask an LLM to look at all k trajectories together and extract the strategies that consistently appear in the successes but not in the failures. That's self-contrastLooking at multiple attempts at the same task and identifying patterns by comparison — what the successful runs did differently from the failed ones. Borrowed from contrastive learning, where signals come from comparing positive and negative examples..
This is qualitatively different from standard "best-of-N" scaling. Best-of-N picks one trajectory and throws the rest away. Self-contrast uses all k trajectories to produce a stronger single piece of memory that can help future tasks.
Sequential scaling: refine one trajectory k times
Take the task, run it once, then critique the result and revise. Do that k times in sequence. This is classic self-refinementAfter producing an answer, the model critiques its own answer and produces a revised version. Originally proposed by Madaan et al. 2023 for single-turn tasks; MaTTS applies it to multi-step agent trajectories.. MaTTS grabs not just the final version but all the intermediate critiques and revisions — those contain useful reasoning signals that don't survive into the final trajectory.
The synergy: memory and scaling amplify each other
The numbers: on WebArena-Shopping with Gemini-2.5-flash and k=5 parallel trajectories, Pass@1A quality-of-a-single-try metric: pick one of the k trajectories at random and ask whether it succeeded. Measures average-case quality, not best-case. rises from 49.7% (ReasoningBank alone) to 53.0% (with MaTTS). Best-of-5If you run the task 5 times, what fraction of tasks have at least one successful trajectory? Measures the ceiling of what scaling can give you if you had a perfect way to pick the winner. rises from 49.7% to 55.1%. For comparison, the memory-free baseline goes from 39.0% to just 42.2% under the same scaling.
Emergent Behaviors
Because memory items accumulate and get re-retrieved as the agent keeps working, the same "strategy" can get rewritten over time. The authors find that these strategies don't just stay the same — they mature.
The progression they observe, illustrated above, looks like this:
- Procedural. Early memory items are literal: "click Next Page, Page X, or Load More links." Concrete action rules.
- Self-reflection. Next stage: the agent learns to catch its own mistakes. "Before clicking, re-check the element's current identifier."
- Adaptive check. It starts combining tools: "Before scanning, leverage any available search or filter functionality; ensure completeness."
- Compositional. Eventually: "Regularly cross-reference the current view with the task requirements. If data doesn't align, reassess available options such as search filters." This is task-agnostic high-level reasoning.
This is reminiscent of what happens in reinforcement learningA training paradigm where a model learns by trying actions, seeing what happens, and adjusting its weights to favor actions that lead to reward. Requires gradient updates — which ReasoningBank does not do. as policies mature from reactive to strategic — except it's happening purely at test time, with no gradient updatesThe step where a neural network's weights are actually changed during training, based on the gradient of the loss function. ReasoningBank never does this — the model's weights stay frozen. All the learning happens in the external memory bank.. The agent is literally rewriting its own strategy sheet as it gains experience.
Failures are where the compositional strategies come from
The authors run an ablation where they strip failed trajectories out of memory construction. Performance drops from 49.7 to 46.5 on WebArena-Shopping. More strikingly, the competitor baselines (Synapse, AWM) either stall or regress when you try to add failure data to them — they were designed for success-only extraction and can't absorb the new signal. ReasoningBank's extraction prompts were built from the start to turn failures into preventative lessons, so the added data actually helps.
Experiments
Setup
The paper tests three model backbones: Gemini-2.5-flash, Gemini-2.5-pro, and Claude-3.7-sonnet. Agents use ReActA standard agent prompting pattern: the agent alternates between explicit "Thought:" and "Action:" outputs at every step. Makes reasoning visible and easy to parse.-style prompting with default decoding settings.
Baselines compared against:
- No Memory — bare agent, no memory module at all.
- Synapse — trajectory-based memory. Stores raw past trajectories and retrieves similar ones as demonstrations.
- AWM (Agent Workflow Memory) — workflow-based memory. Abstracts successful trajectories into reusable procedures.
Metrics: Success Rate (SR, higher is better) and average Steps (lower is better).
Results
WebArena overall
ReasoningBank beats every baseline on every model on every subset. A sampling on Gemini-2.5-flash:
| Method | Shopping SR | Admin SR | Gitlab SR | Reddit SR | Overall SR | Avg Steps |
|---|---|---|---|---|---|---|
| No Memory | 39.0 | 44.5 | 33.9 | 55.7 | 40.5 | 9.7 |
| Synapse | 40.6 | 45.1 | 35.6 | 59.4 | 42.1 | 9.2 |
| AWM | 44.4 | 46.7 | 37.2 | 62.3 | 44.1 | 9.0 |
| ReasoningBank | 49.7 | 51.1 | 40.6 | 67.0 | 48.8 | 8.3 |
| + MaTTS | 53.0 | 53.8 | 42.8 | 70.8 | 51.8 | 7.9 |
ReasoningBank alone lifts overall success rate from 40.5 → 48.8 (a relative gain of ~20%) and cuts steps from 9.7 → 8.3. Adding MaTTS pushes success to 51.8. These gains hold across Gemini-2.5-pro and Claude-3.7-sonnet as well, with smaller but consistent improvements.
Generalization: cross-domain tasks
On Mind2Web's cross-domain split — where test websites are entirely different from training — ReasoningBank shows its biggest relative gains. AWM's workflow memory, which was designed around narrow procedural patterns, sometimes hurts in this setting because the procedures don't transfer. Strategy-level memory transfers better.
Software engineering
On SWE-Bench-Verified (real GitHub issues), ReasoningBank lifts resolution rate from 34.2% to 38.8% on Gemini-2.5-flash and from 54.0% to 57.4% on Gemini-2.5-pro. Average steps drop from 30.3 to 27.5 on flash — real money saved on long-horizon code tasks.
Efficiency: fewer steps, especially on the tasks you solve
A desirable memory system cuts down exploration on problems the agent is going to solve, not just truncates doomed trajectories. The paper shows ReasoningBank does the former: the step reduction is much larger on successful trajectories (up to 2.1 fewer steps, 27% relative) than on failures (0.2-1.4 fewer). The agent isn't giving up earlier — it's finding the right path faster.
Final Quiz
Why This Paper Matters
For builders and practitioners
If you're deploying an agent in production — a customer support agent, a web automation agent, an ops-assist agent — ReasoningBank is the rare "free" win: no fine-tuningTraining a pretrained model further on your own data to adapt it to a specific task. Requires GPUs, training data, and weight updates. ReasoningBank avoids all of this., no new infrastructure, just a prompt-and-retrieval layer that your agent gets better over time. The authors report 20% relative success-rate gains and 16% fewer steps, which for an LLM-call-bottlenecked product translates directly to better outcomes and lower inference cost. The memory bank is also human-readable, so engineers can audit and prune what's in it.
The robustness to an imperfect judge is important practically — you don't need a gold-standard evaluator, just a good-enough LLM reviewing its own work.
For the research community
Two findings are load-bearing. First: failures are useful training signal at test time, not just at train time. The dominant paradigm has been to filter for successful demonstrations; ReasoningBank shows that with the right extraction prompts, a failed trajectory teaches you things a successful one can't.
Second: memory and test-time scaling are complements, not substitutes. Before this paper, they were studied in parallel silos. The synergy curve — where weak memory hurts scaling and strong memory amplifies it — suggests the two should be co-designed going forward. The authors frame "memory-driven experience scaling" as a new scaling axis alongside model size and test-time compute.
The bigger picture
This is a paper about where additional capability comes from once you've frozen the weights. For years the answer was "train a bigger model" or "give it more training data." More recently it's been "let it think longer at test time." ReasoningBank points at a third lever: let the agent accumulate a library of lessons from its own experience, and reuse them. As agents move from one-shot tools to long-running systems handling thousands of tasks, the ability to self-improve without retraining stops being a nice-to-have and starts becoming the thing that separates an agent you can deploy from one you can't.
It also suggests a concrete path toward what survey papers have been calling "self-evolving agents." The emergent progression from procedural to compositional strategies — without any gradient updates — is a small proof-of-concept that lifelong-learning agents don't necessarily require lifelong training.
Glossary
A language model wrapped in a loop: observe → think → act → repeat. Can use tools, click buttons, run commands.
The full log of one task: every thought, action, and observation, start to finish.
A growing collection of structured memory items (title + description + content) distilled from past successes and failures.
One entry in the bank: a short reusable reasoning strategy or heuristic, not a raw trajectory.
Spending more compute at inference time (not training time) to get better results. Examples: best-of-N, self-refinement.
Memory-aware Test-Time Scaling. TTS combined with ReasoningBank: scaled exploration produces contrastive signal that feeds back into memory.
Run the agent k independent times on the same task. In MaTTS, all k trajectories feed self-contrast.
Run the agent once, then refine its output k times. In MaTTS, intermediate critiques also feed memory.
Identifying strategies by comparing multiple attempts at the same task — what successful runs did differently from failed ones.
A technique where the model critiques its own output and produces an improved version. Repeated iteratively.
Using a language model to evaluate outputs — here, to label trajectories success or failure without ground truth.
A numerical fingerprint of a piece of text that enables semantic similarity search. Similar meanings → similar vectors.
A prompting style for agents: alternate "Thought:" and "Action:" outputs at every step. Makes reasoning visible.
Two evaluation metrics: Pass@1 picks one trajectory at random; Best-of-N picks the best of N trajectories.
The fraction of tasks where the agent completes the goal. The primary effectiveness metric.
Three agent benchmarks: WebArena is a self-hosted suite of real websites, Mind2Web tests cross-domain web generalization, SWE-Bench-Verified is real GitHub issues.
A prior memory baseline that stores raw past trajectories and retrieves similar ones.
A prior memory baseline that abstracts successful trajectories into reusable workflows/procedures.