RADAR: The Paper, Explained
A beginner-friendly guide to how Meta automatically reviews and ships low-risk code changes — so human reviewers aren't drowned by the flood of AI-generated code. Every technical term is defined. Every concept is grounded in analogy.
The Big Picture
Before any code change goes live at a big software company, another engineer usually reads it first. This is code reviewThe practice of having one or more engineers read a proposed code change before it goes into the shared codebase. It catches bugs, enforces standards, spreads knowledge, and creates accountability.: a second pair of eyes that catches bugs, keeps quality high, and makes sure someone other than the author understands the change. It works well — as long as there are enough reviewers to keep up.
That assumption just broke. Over the past year at Meta, agentic AIAI systems that don't just suggest code but autonomously carry out multi-step coding tasks — writing, testing, and refactoring code on their own, then submitting it for review like a human engineer would. coding tools caused the amount of code per change to jump +105.9%, and the number of changes each developer produces per month to rise +51% — with 80%+ of that growth coming from AI. Humans can't read faster to compensate, so the share of changes reviewed within 24 hours started falling, and review queues piled up into the thousands.
RADAR is Meta's answer. Instead of asking humans to review everything, it identifies the changes that are low-risk and routine — and reviews and ships those automatically, reserving human attention for the changes that actually need judgment. The paper tackles several problems at once:
- Code is being created faster than it can be reviewed. AI agents produce a flood of changes, but human reviewer capacity is fixed. Something has to absorb the overflow without lowering the bar on safety.
- Not every change deserves the same scrutiny. A mechanical rename or a dead-code deletion is very different from a change to payment logic — but a traditional review queue treats them the same, so trivial changes compete with dangerous ones for a reviewer's time.
- Automating review is risky if done bluntly. Approve too much and you ship bugs; approve too little and you haven't helped. The hard part is finding a dial you can turn safely, and proving it stays safe at massive scale.
Background Concepts
This paper assumes you know how code ships at a large company. Let's build that up from scratch — the vocabulary here is from software engineering, not deep learning.
Diffs, landing, and the review workflow
At Meta, a single proposed code change is called a diffMeta's name for one self-contained proposed code change (elsewhere called a "pull request" or "changeset"). It bundles the edited lines, a description, and a test plan, and goes through review before being merged. — the equivalent of a "pull request" on GitHub. An engineer writes a diff, describes it, and requests review. Reviewers leave comments; the author revises. When everyone's satisfied, the diff lands"Landing" a diff means merging it into the shared codebase so it becomes part of the product. The opposite outcome is "abandoning" the diff. — it gets merged into the codebase and ships. All of this runs through PhabricatorThe code-review and continuous-integration tool Meta uses. It tracks each diff's metadata, the actions reviewers take, timestamps, and the diff's current state — the raw data this paper studies., Meta's review tool, which records every action and timestamp.
Reverts and Production Incidents: how we measure "did it go wrong?"
Two signals tell you a landed change caused trouble. A revertUndoing a change that already landed, because it caused a problem. A high revert rate signals that changes are landing before they're truly ready. is when a landed diff gets undone because something broke. A Production Incident (PI)A logged safety/reliability event in the live product attributable to a code change — an outage, a serious bug, a regression. PIs are the most serious negative outcome and the thing RADAR is most careful to avoid. is more serious: a logged outage or failure in the live product. These are the paper's guardrail metrics — the numbers that must not get worse when you automate. RADAR's headline safety claim is that its changes are reverted 1/13 as often and cause PIs 1/50 as often as non-RADAR changes.
Codemods and RACER: where the bot-written code comes from
A lot of code changes are mechanical and repetitive. A codemodA scripted, automated code transformation applied across a codebase — e.g. "rename this function everywhere" or "migrate every call from the old API to the new one." Can be fully deterministic (a fixed rule) or LLM-generated. is a scripted transformation applied across the codebase — like "rename this function everywhere" or "migrate every caller to the new APIApplication Programming Interface — the defined set of functions and rules one piece of software exposes for others to call. An "API migration" mechanically updates every place that calls an old interface to use a new one.." RACERRisk-Aware Code Editing and Refactoring — Meta's GenAI tool that delegates well-defined coding tasks (dead-code removal, complexity reduction, lint fixes, migrations) to an AI agent, which generates diffs for review. A primary source of bot-authored diffs into RADAR. (Risk-Aware Code Editing and Refactoring) is Meta's AI agent that handles bigger but still well-defined jobs — removing dead code, reducing complexity, fixing lint, migrating frameworks — by generating diffs from pre-written instruction templates called runbooksA pre-configured prompt/recipe for RACER that encodes the instructions, constraints, and context for one specific type of code change. Each runbook has its own track record and safety settings.. These bot-authored diffs are exactly the high-volume, low-judgment changes RADAR is designed to absorb.
The two AI components: a risk score and an AI reviewer
RADAR leans on two pieces of machine learning, and it's worth separating them clearly because they do different jobs:
Diff Risk Score (DRS)
A machine-learning modelA program that learns patterns from past data. DRS was trained on Meta's history of diffs and which ones caused incidents, so it can predict risk for a new diff. that looks at a diff's metadata (who, what files, how big, historical patterns) and outputs a single number: how likely this change is to cause a production incident. It doesn't read the code's meaning — it predicts risk from the shape of the change.
Automated Code Review (ACR)
An LLMLarge Language Model — the kind of AI (like GPT or Llama) trained on huge amounts of text and code. Here it's used to actually read and understand what a code change does, beyond surface metadata.-based reviewer that actually reads the changed code and classifies it against known "safe" and "risky" patterns — then makes an accept-or-reject call. It supplies the semantic understanding that a metadata score can't.
What kinds of patterns does the AI reviewer treat as "safe" vs "risky"?
Safe signals (can be auto-accepted): refactoring with no behavior change, dead-code removal, defensive checks, added logging, pure formatting, documentation/comment updates, import cleanup, added tests, and static resource updates.
Risk signals (instant disqualification): high estimated review effort (a complexity score of 4+), large structural changes, identified bugs or logic errors, performance risks, and security vulnerabilities like exposed secrets, SQL injectionA classic attack where malicious input is crafted so a program accidentally runs it as a database command — letting an attacker read or destroy data. A canonical "never ship this without a human looking" red flag., or authentication bypasses.
To auto-accept, the AI reviewer must be highly confident — at least 8 out of 10 — and every change in the diff must fall into a safe category. Any single risk signal kills the auto-accept and sends the diff to a human.
How RADAR Works
RADAR is deliberately not a single clever model. It's a funnel: a sequence of independent safety layers, where a diff must clear every one to be auto-approved. If it fails any layer, it falls out of the funnel and goes to a human — the default-safe outcome. This design is what lets Meta roll it out gradually and tune it without ever removing the safety net.
The three core checks, in order, are:
Notice the layers are complementary. Static rules are cheap but blind to meaning. The risk score reads patterns but not intent. The AI reviewer reads intent but is the most expensive. Stacking them means a cheap layer can reject a diff before an expensive layer ever runs — and a smart layer can rescue a diff the dumb layer would wrongly fear (a huge-but-trivial mechanical refactor) or catch a subtle bug the pattern-matchers miss.
Why land after a delay instead of instantly?
Even when RADAR approves a bot diff, it doesn't merge it the very same second. Approved diffs sit in a landing queue for a configurable window (e.g. 24 hours) during which a human can still step in and reject. It's a final safety valve: automation moves things forward by default, but a human always retains the ability to pull the cord before it's irreversible.
Who Qualifies: the Eligibility Model
Before a diff ever reaches the three-layer funnel, RADAR first asks: is this diff even allowed to be considered for automation? This is the eligibility model, and the authors single it out as a key contribution. The trick is that different sources of code get different rules, because they carry different risk.
The first split is human vs. bot. Bot diffs are then split further by how they were generated:
The four gates a RACER runbook must pass to be eligible
- Risk-history heuristics: over a 60-day lookback, the runbook must have zero production incidents, a low revert rate, a low human-rejection rate, and enough landed diffs to be statistically meaningful.
- Per-runbook daily limits: caps on how many diffs a single runbook can auto-land per day (default conservative; trusted runbooks can be raised up to 2,000/day) so no one source floods the commit queue.
- Per-runbook risk thresholds: trusted ("allowlisted") runbooks get the relaxed P50 risk threshold; everyone else uses the stricter P20.
- Denylist: runbooks that have caused incidents or touch sensitive areas are permanently blocked. Anything with "test" in its name is also excluded, to avoid silently automating changes to test infrastructure.
Why organizations can set their own rules
Eligibility thresholds aren't global. Each org configures its own risk appetite through a policy config controlling its risk thresholds, whether deferred review is on, and which automation sources are allowed. One org might effectively bypass the risk-score gate for bot diffs and rely on the AI reviewer alone; another keeps stricter defaults. This is how one system serves a whole company of teams with very different tolerances.
The Two Pipelines
Eligibility routes a diff into one of two pipelines depending on whether a bot or a human wrote it. They share the same building blocks but answer slightly different questions.
Bot diffs: the ACE pipeline
For bot-authored diffs, RADAR applies a policy called ACEAI Commit Eligibility — the policy that lets changes from automated sources land with no human review at all, provided they pass strict safety criteria (static checks, risk score, and the AI reviewer). (AI Commit Eligibility): if a diff clears all three funnel layers, it can land with no human review at all. This is the aggressive case — and it's allowed precisely because bot diffs come from vetted, monitored sources with daily volume caps and the same delay-before-landing safety valve.
Human diffs: Verification, then Approval
Human diffs get a more cautious two-step treatment, and the human author always stays in control — they can ship with RADAR, wait for a human, or send the diff back to "needs review" at any time.
The distinction is subtle but important. Verification says "this looks safe enough to ship now and have a human glance at it later." Approval says "this is safe enough that no human needs to look at all." The two-step design lets RADAR be helpful (ship now, review later) for a broad set of diffs while reserving full automation for an even safer subset.
Tuning the Risk Dial
The single most tunable knob in RADAR is the Diff Risk Score thresholdA percentile cutoff (written PX) on the risk score. P5 means only the lowest-risk 5% of diffs qualify; P50 means the safest 50% qualify. Lower = more conservative; higher = more diffs automated.. It's expressed as a percentile: P5 means only the safest 5% of diffs qualify, P50 means the safest 50% do. A lower threshold is more conservative (fewer diffs automated, less risk); a higher one widens the net.
The key experiment in the paper (their second research question) relaxed this threshold from P25 to P50. The worry with widening any safety net is that risk grows with it. But RADAR's result was the opposite of alarming: the auto-approve rate climbed to 60.31% — a big jump in how much work the system absorbs — while the revert and incident rates didn't degrade.
Results
The paper is organized around three research questions: can it work at scale (feasibility), can you tune it safely (calibration), and does it actually save time (impact)?
Feasibility: it runs at real scale, safely
| Safety guardrail | RADAR vs. non-RADAR diffs |
|---|---|
| Revert rate | 1⁄13 (about 92% lower) |
| Production Incident (PI) rate | 1⁄50 (about 98% lower) |
Those safety gaps are statistically significant, and a telling detail: when domain experts manually reviewed the production incidents, none were judged to be ones a human reviewer would have caught. In other words, the rare failures weren't failures of automation replacing a human's catch — they were the kind of thing that would have slipped past a person too.
Why is RADAR's revert/incident rate so much lower — isn't that suspicious?
It's not magic, and the authors are careful to call it an association, not proof of causation. RADAR only automates diffs that are already low-risk by construction — that's the whole point of the funnel. So you'd expect them to revert and break less than the general population of diffs, which includes every gnarly, high-stakes change humans handle. The result confirms the funnel is selecting what it's supposed to, rather than claiming automation makes any given change safer.
Impact: it removes the wait, not just the work
The third question is whether automation actually relieves the bottleneck. Compared to human-reviewed diffs, RADAR cut the median time to closeThe end-to-end time from when a diff is published to when it's finally closed (landed or abandoned). The median is the middle value, so half of diffs close faster and half slower. by over 330% and the median diff review wall timeThe real-world clock time a diff spends waiting for and undergoing review — the "sitting in the queue" time, as opposed to active work time. by 35%.
The wall-time number is the meaningful one: it's the time a diff sits waiting for a human to get to it. By handling eligible diffs automatically, RADAR stops low-risk changes from clogging the queue and competing with high-risk changes for a reviewer's attention — which is the bottleneck the whole system was built to relieve.
What does "over 330% reduction" actually mean — and the honest caveats
A reduction over 100% is a slightly loose way of saying the human-reviewed baseline took several times longer; read it as "many times faster to close," not a literal percentage of a single quantity. The authors are upfront about the limits: this is an observational comparison, not a randomized experiment, so it shows association rather than airtight causation. Time-to-close and wall-time also measure responsiveness, not whether defects were prevented — those are tracked separately by the revert and incident rates.
Final Quiz
Why This Paper Matters
For builders and practitioners
If your team is feeling the same squeeze — AI assistants generating more code than anyone can review — RADAR is a concrete, copyable blueprint. The takeaways are practical: stack cheap-to-expensive checks so you fail fast; make trust earned and revocable per source rather than granted by category; always keep a default-safe fallback (route to a human) and a delay-before-landing valve; and expose a single tunable dial (the risk threshold) so each team can pick its own risk appetite. Crucially, it shows you can deploy this incrementally — start at the safest 5%, watch the guardrails, and widen as confidence grows — rather than betting the company on a big-bang rollout.
For the research community
Two findings are load-bearing. First, this inverts how risk models are used: prior work surfaced risk scores to inform human reviewers or route diffs to better ones. RADAR uses the same kind of score to take action — to decide which diffs need no human at all. That's a shift from risk-as-information to risk-as-automation. Second, it's a rare large-scale operational record (535K+ diffs, real production incidents, real revert rates) rather than a benchmark study, which is exactly the kind of evidence the field lacks as AI-generated code floods in. The honest caveats — it's observational, Meta-specific, and the metrics are proxies — are stated plainly, which makes the numbers more trustworthy, not less.
The bigger picture
For decades, the bottleneck in shipping software was writing the code. Generative AI is dissolving that bottleneck and quietly relocating it downstream — to review, testing, and the human accountability that gates production. RADAR is an early, concrete acknowledgment that as machines write more of the code, the systems that review code have to become machines too, or the whole pipeline stalls. The interesting open question the paper raises but doesn't resolve: code review was never only about catching bugs — it's also how engineers learn the codebase and spread knowledge. As more review gets automated, that human knowledge-transfer shrinks. The authors find the trade-off is currently favorable, but flag that it may shift as automation expands — a tension the whole industry will be navigating for years.
Glossary
Meta's name for one self-contained proposed code change (a "pull request" elsewhere): edited lines, a description, and a test plan.
Merging a diff into the shared codebase so it ships. The opposite is abandoning the diff.
Having another engineer read a change before it ships — catches bugs, enforces standards, spreads knowledge.
Meta's code-review and continuous-integration tool. Tracks every diff's metadata, actions, timestamps, and state.
AI that autonomously carries out multi-step coding tasks — writing, testing, refactoring — then submits diffs like a human.
Undoing a landed change because it caused a problem. A guardrail metric.
A logged outage or serious failure in the live product attributable to a change. The most serious negative outcome.
A scripted code transformation applied across the codebase. Deterministic (a fixed rule) or LLM-generated.
Meta's GenAI tool that delegates well-defined coding tasks to an AI agent, which generates diffs from runbooks.
A pre-configured RACER prompt for one type of change. Each has its own track record and safety settings.
Risk Aware Diff Auto Review — the layered funnel that auto-reviews and lands low-risk diffs.
An ML model predicting how likely a diff is to cause an incident, from metadata. Expressed as a percentile threshold.
An LLM that reads the actual code change and classifies it as safe or risky, making accept/reject calls.
Large Language Model — AI trained on huge text/code corpora; here used to read and understand code changes.
A sequence of safety filters a diff must all pass to auto-qualify. Each layer removes more candidates.
AI Commit Eligibility — the policy letting bot diffs land with no human review when they pass all safety checks.
Step for human diffs: pass and you may ship now with a human review deferred to after landing.
Stricter step that waives the deferred review entirely — no human review needed at all.
The risk-score cutoff. P5 = safest 5% qualify; P50 = safest 50%. Lower is more conservative.
Code governing financial reporting (Sarbanes-Oxley). Legally requires human review; RADAR never automates it.
End-to-end time from publishing a diff to closing it (landed or abandoned).
The clock time a diff spends waiting for and undergoing review — the queue time.