RADAR: The Paper, Explained

A beginner-friendly guide to how Meta automatically reviews and ships low-risk code changes — so human reviewers aren't drowned by the flood of AI-generated code. Every technical term is defined. Every concept is grounded in analogy.

Paper by Adams, Banga, Bansal et al. (Meta, 2026) • Explainer published June 2026

made withHyperFrames AI agents now write most of the new code, so the volume of changes is exploding — but the number of humans available to review it stays flat. The gap between the two is a growing backlog.

The Big Picture

Before any code change goes live at a big software company, another engineer usually reads it first. This is code reviewThe practice of having one or more engineers read a proposed code change before it goes into the shared codebase. It catches bugs, enforces standards, spreads knowledge, and creates accountability.: a second pair of eyes that catches bugs, keeps quality high, and makes sure someone other than the author understands the change. It works well — as long as there are enough reviewers to keep up.

That assumption just broke. Over the past year at Meta, agentic AIAI systems that don't just suggest code but autonomously carry out multi-step coding tasks — writing, testing, and refactoring code on their own, then submitting it for review like a human engineer would. coding tools caused the amount of code per change to jump +105.9%, and the number of changes each developer produces per month to rise +51% — with 80%+ of that growth coming from AI. Humans can't read faster to compensate, so the share of changes reviewed within 24 hours started falling, and review queues piled up into the thousands.

RADAR is Meta's answer. Instead of asking humans to review everything, it identifies the changes that are low-risk and routine — and reviews and ships those automatically, reserving human attention for the changes that actually need judgment. The paper tackles several problems at once:

Code is being created faster than it can be reviewed. AI agents produce a flood of changes, but human reviewer capacity is fixed. Something has to absorb the overflow without lowering the bar on safety.
Not every change deserves the same scrutiny. A mechanical rename or a dead-code deletion is very different from a change to payment logic — but a traditional review queue treats them the same, so trivial changes compete with dangerous ones for a reviewer's time.
Automating review is risky if done bluntly. Approve too much and you ship bugs; approve too little and you haven't helped. The hard part is finding a dial you can turn safely, and proving it stays safe at massive scale.

Don't try to replace code review. Instead, build a conservative, layered funnel that only lets a change through if it passes every safety check — static rules, a machine-learned risk score, and an AI reviewer that actually reads the code. Tune how aggressive the funnel is per team, watch the safety numbers, and expand gradually. The result: 535K+ changes auto-reviewed with a revert rate 1/13 and an incident rate 1/50 that of normal changes — while cutting review wait time by 35%.

Background Concepts

This paper assumes you know how code ships at a large company. Let's build that up from scratch — the vocabulary here is from software engineering, not deep learning.

Diffs, landing, and the review workflow

At Meta, a single proposed code change is called a diffMeta's name for one self-contained proposed code change (elsewhere called a "pull request" or "changeset"). It bundles the edited lines, a description, and a test plan, and goes through review before being merged. — the equivalent of a "pull request" on GitHub. An engineer writes a diff, describes it, and requests review. Reviewers leave comments; the author revises. When everyone's satisfied, the diff lands"Landing" a diff means merging it into the shared codebase so it becomes part of the product. The opposite outcome is "abandoning" the diff. — it gets merged into the codebase and ships. All of this runs through PhabricatorThe code-review and continuous-integration tool Meta uses. It tracks each diff's metadata, the actions reviewers take, timestamps, and the diff's current state — the raw data this paper studies., Meta's review tool, which records every action and timestamp.

Think of a diff as a draft article submitted to a newspaper editor. The editor (reviewer) marks it up, the writer revises, and once approved it gets "published" (landed). RADAR is like an automated copy-desk that can clear the obviously-clean drafts — a typo fix, a reformatting — so the human editors can focus on the investigative pieces that need real judgment.

Reverts and Production Incidents: how we measure "did it go wrong?"

Two signals tell you a landed change caused trouble. A revertUndoing a change that already landed, because it caused a problem. A high revert rate signals that changes are landing before they're truly ready. is when a landed diff gets undone because something broke. A Production Incident (PI)A logged safety/reliability event in the live product attributable to a code change — an outage, a serious bug, a regression. PIs are the most serious negative outcome and the thing RADAR is most careful to avoid. is more serious: a logged outage or failure in the live product. These are the paper's guardrail metrics — the numbers that must not get worse when you automate. RADAR's headline safety claim is that its changes are reverted 1/13 as often and cause PIs 1/50 as often as non-RADAR changes.

Codemods and RACER: where the bot-written code comes from

A lot of code changes are mechanical and repetitive. A codemodA scripted, automated code transformation applied across a codebase — e.g. "rename this function everywhere" or "migrate every call from the old API to the new one." Can be fully deterministic (a fixed rule) or LLM-generated. is a scripted transformation applied across the codebase — like "rename this function everywhere" or "migrate every caller to the new APIApplication Programming Interface — the defined set of functions and rules one piece of software exposes for others to call. An "API migration" mechanically updates every place that calls an old interface to use a new one.." RACERRisk-Aware Code Editing and Refactoring — Meta's GenAI tool that delegates well-defined coding tasks (dead-code removal, complexity reduction, lint fixes, migrations) to an AI agent, which generates diffs for review. A primary source of bot-authored diffs into RADAR. (Risk-Aware Code Editing and Refactoring) is Meta's AI agent that handles bigger but still well-defined jobs — removing dead code, reducing complexity, fixing lint, migrating frameworks — by generating diffs from pre-written instruction templates called runbooksA pre-configured prompt/recipe for RACER that encodes the instructions, constraints, and context for one specific type of code change. Each runbook has its own track record and safety settings.. These bot-authored diffs are exactly the high-volume, low-judgment changes RADAR is designed to absorb.

The two AI components: a risk score and an AI reviewer

RADAR leans on two pieces of machine learning, and it's worth separating them clearly because they do different jobs:

Diff Risk Score (DRS)

A machine-learning modelA program that learns patterns from past data. DRS was trained on Meta's history of diffs and which ones caused incidents, so it can predict risk for a new diff. that looks at a diff's metadata (who, what files, how big, historical patterns) and outputs a single number: how likely this change is to cause a production incident. It doesn't read the code's meaning — it predicts risk from the shape of the change.

Automated Code Review (ACR)

An LLMLarge Language Model — the kind of AI (like GPT or Llama) trained on huge amounts of text and code. Here it's used to actually read and understand what a code change does, beyond surface metadata.-based reviewer that actually reads the changed code and classifies it against known "safe" and "risky" patterns — then makes an accept-or-reject call. It supplies the semantic understanding that a metadata score can't.

What kinds of patterns does the AI reviewer treat as "safe" vs "risky"?

Safe signals (can be auto-accepted): refactoring with no behavior change, dead-code removal, defensive checks, added logging, pure formatting, documentation/comment updates, import cleanup, added tests, and static resource updates.

Risk signals (instant disqualification): high estimated review effort (a complexity score of 4+), large structural changes, identified bugs or logic errors, performance risks, and security vulnerabilities like exposed secrets, SQL injectionA classic attack where malicious input is crafted so a program accidentally runs it as a database command — letting an attacker read or destroy data. A canonical "never ship this without a human looking" red flag., or authentication bypasses.

To auto-accept, the AI reviewer must be highly confident — at least 8 out of 10 — and every change in the diff must fall into a safe category. Any single risk signal kills the auto-accept and sends the diff to a human.

How RADAR Works

RADAR is deliberately not a single clever model. It's a funnel: a sequence of independent safety layers, where a diff must clear every one to be auto-approved. If it fails any layer, it falls out of the funnel and goes to a human — the default-safe outcome. This design is what lets Meta roll it out gradually and tune it without ever removing the safety net.

made withHyperFrames Each diff must pass every layer in turn — static rules, then the risk score, then the AI reviewer. Anything that trips a single check is routed to a human. Only what survives all of them auto-lands.

The three core checks, in order, are:

a new diff that's already passed eligibility

1 · Static heuristics

cheap, hard rules on metadata: the diff must not touch open-source code, SOX-scoped code, or anything requiring extra reviews, and must come from an approved automation source

2 · Diff Risk Score gate

the ML risk score must fall below a configurable threshold (e.g. only the safest 5%, 20%, or 50% of diffs qualify)

3 · RADAR Review Agent (the AI reviewer)
an LLM reads the actual code change, verifies no judgment-requiring business logic was touched, and confirms every change is a known-safe pattern. Any risk signal → rejected to a human.

passes all three → lands automatically (after a delay window)

Notice the layers are complementary. Static rules are cheap but blind to meaning. The risk score reads patterns but not intent. The AI reviewer reads intent but is the most expensive. Stacking them means a cheap layer can reject a diff before an expensive layer ever runs — and a smart layer can rescue a diff the dumb layer would wrongly fear (a huge-but-trivial mechanical refactor) or catch a subtle bug the pattern-matchers miss.

Why land after a delay instead of instantly?

Even when RADAR approves a bot diff, it doesn't merge it the very same second. Approved diffs sit in a landing queue for a configurable window (e.g. 24 hours) during which a human can still step in and reject. It's a final safety valve: automation moves things forward by default, but a human always retains the ability to pull the cord before it's irreversible.

Who Qualifies: the Eligibility Model

Before a diff ever reaches the three-layer funnel, RADAR first asks: is this diff even allowed to be considered for automation? This is the eligibility model, and the authors single it out as a key contribution. The trick is that different sources of code get different rules, because they carry different risk.

made withHyperFrames Every diff is first sorted by who wrote it and how. Each source type then flows down its own path, with stricter or looser rules to match its risk profile.

The first split is human vs. bot. Bot diffs are then split further by how they were generated:

Deterministic codemods → Blanket AutoAccept A fully-specified, no-AI transformation (like a mechanical API migration). Because the recipe itself was vetted once, every diff it produces is trusted — these bypass per-diff AI review entirely. The least restrictive path.

AI-generated codemods → per-diff review When an LLM generates the transformation, each diff can vary, so each one must individually pass the full funnel (risk score + AI review). Variability earns per-diff scrutiny.

RACER runbooks → per-runbook eligibility, then the funnel The most granular path. Each runbook must earn eligibility before its diffs even enter the funnel — see below.

Human-authored diffs → Verification + Approval Eligible only if the author qualifies (role, tenure, on-call ownership) and the diff clears scope and content rules. Then it enters a two-step human-diff pipeline.

The per-runbook granularity is RADAR's most distinctive idea: two RACER runbooks might produce identical-looking transformations, but if one has a history of getting reverted, it's blocked while the other sails through. Trust is earned per source, from its own track record — not granted to a category wholesale.

The four gates a RACER runbook must pass to be eligible

Risk-history heuristics: over a 60-day lookback, the runbook must have zero production incidents, a low revert rate, a low human-rejection rate, and enough landed diffs to be statistically meaningful.
Per-runbook daily limits: caps on how many diffs a single runbook can auto-land per day (default conservative; trusted runbooks can be raised up to 2,000/day) so no one source floods the commit queue.
Per-runbook risk thresholds: trusted ("allowlisted") runbooks get the relaxed P50 risk threshold; everyone else uses the stricter P20.
Denylist: runbooks that have caused incidents or touch sensitive areas are permanently blocked. Anything with "test" in its name is also excluded, to avoid silently automating changes to test infrastructure.

Why organizations can set their own rules

Eligibility thresholds aren't global. Each org configures its own risk appetite through a policy config controlling its risk thresholds, whether deferred review is on, and which automation sources are allowed. One org might effectively bypass the risk-score gate for bot diffs and rely on the AI reviewer alone; another keeps stricter defaults. This is how one system serves a whole company of teams with very different tolerances.

The Two Pipelines

Eligibility routes a diff into one of two pipelines depending on whether a bot or a human wrote it. They share the same building blocks but answer slightly different questions.

Bot diffs: the ACE pipeline

For bot-authored diffs, RADAR applies a policy called ACEAI Commit Eligibility — the policy that lets changes from automated sources land with no human review at all, provided they pass strict safety criteria (static checks, risk score, and the AI reviewer). (AI Commit Eligibility): if a diff clears all three funnel layers, it can land with no human review at all. This is the aggressive case — and it's allowed precisely because bot diffs come from vetted, monitored sources with daily volume caps and the same delay-before-landing safety valve.

Human diffs: Verification, then Approval

Human diffs get a more cautious two-step treatment, and the human author always stays in control — they can ship with RADAR, wait for a human, or send the diff back to "needs review" at any time.

1 RADAR Verification Checks eligibility, content, and runs the AI reviewer + risk score (default: safest 5%). Pass → the author may ship now with a human review deferred to after landing.

→

2 RADAR Approval Re-checks verified diffs against stricter criteria. Pass → the deferred review is waived entirely: no human review is needed at all.

The distinction is subtle but important. Verification says "this looks safe enough to ship now and have a human glance at it later." Approval says "this is safe enough that no human needs to look at all." The two-step design lets RADAR be helpful (ship now, review later) for a broad set of diffs while reserving full automation for an even safer subset.

Verification is the express lane at airport security for trusted travelers — you go through faster, with a chance of a random secondary check later. Approval is being waved through entirely because you're a flight crew member with credentials: no secondary check at all. Same checkpoint, two tiers of trust.

Tuning the Risk Dial

The single most tunable knob in RADAR is the Diff Risk Score thresholdA percentile cutoff (written PX) on the risk score. P5 means only the lowest-risk 5% of diffs qualify; P50 means the safest 50% qualify. Lower = more conservative; higher = more diffs automated.. It's expressed as a percentile: P5 means only the safest 5% of diffs qualify, P50 means the safest 50% do. A lower threshold is more conservative (fewer diffs automated, less risk); a higher one widens the net.

made withHyperFrames Diffs sorted from safest to riskiest. Sliding the threshold from P5 toward P50 lets more diffs auto-qualify — lifting the approve rate from 25% to 60.31% — while reverts and incidents stayed flat.

The key experiment in the paper (their second research question) relaxed this threshold from P25 to P50. The worry with widening any safety net is that risk grows with it. But RADAR's result was the opposite of alarming: the auto-approve rate climbed to 60.31% — a big jump in how much work the system absorbs — while the revert and incident rates didn't degrade.

The relationship between the risk threshold and safety is not linear. A wider approval envelope captured many more genuinely-low-risk diffs without a proportional increase in incidents. The practical lesson: start conservative, watch the guardrails, and relax the dial as your operational confidence grows.

Results

The paper is organized around three research questions: can it work at scale (feasibility), can you tune it safely (calibration), and does it actually save time (impact)?

Feasibility: it runs at real scale, safely

535K+

diffs auto-reviewed

331K+

diffs auto-landed

25K

diffs reviewed per day (peak)

60.31%

current auto-approve rate

Safety guardrail	RADAR vs. non-RADAR diffs
Revert rate	1⁄13 (about 92% lower)
Production Incident (PI) rate	1⁄50 (about 98% lower)

Those safety gaps are statistically significant, and a telling detail: when domain experts manually reviewed the production incidents, none were judged to be ones a human reviewer would have caught. In other words, the rare failures weren't failures of automation replacing a human's catch — they were the kind of thing that would have slipped past a person too.

Why is RADAR's revert/incident rate so much lower — isn't that suspicious?

It's not magic, and the authors are careful to call it an association, not proof of causation. RADAR only automates diffs that are already low-risk by construction — that's the whole point of the funnel. So you'd expect them to revert and break less than the general population of diffs, which includes every gnarly, high-stakes change humans handle. The result confirms the funnel is selecting what it's supposed to, rather than claiming automation makes any given change safer.

Impact: it removes the wait, not just the work

The third question is whether automation actually relieves the bottleneck. Compared to human-reviewed diffs, RADAR cut the median time to closeThe end-to-end time from when a diff is published to when it's finally closed (landed or abandoned). The median is the middle value, so half of diffs close faster and half slower. by over 330% and the median diff review wall timeThe real-world clock time a diff spends waiting for and undergoing review — the "sitting in the queue" time, as opposed to active work time. by 35%.

The wall-time number is the meaningful one: it's the time a diff sits waiting for a human to get to it. By handling eligible diffs automatically, RADAR stops low-risk changes from clogging the queue and competing with high-risk changes for a reviewer's attention — which is the bottleneck the whole system was built to relieve.

What does "over 330% reduction" actually mean — and the honest caveats

A reduction over 100% is a slightly loose way of saying the human-reviewed baseline took several times longer; read it as "many times faster to close," not a literal percentage of a single quantity. The authors are upfront about the limits: this is an observational comparison, not a randomized experiment, so it shows association rather than airtight causation. Time-to-close and wall-time also measure responsiveness, not whether defects were prevented — those are tracked separately by the revert and incident rates.

Final Quiz

What is RADAR's core design, and why does it matter?

A single large model that decides whether to approve any diff. A layered funnel where a diff must pass every safety check (static rules, risk score, AI reviewer) to be auto-approved — and failing any layer sends it to a human. A tool that replaces all human code review at Meta. A faster text editor for writing diffs.

What's the difference between the Diff Risk Score (DRS) and Automated Code Review (ACR)?

They're two names for the same model. DRS reads the code's meaning; ACR only looks at metadata. DRS is an ML model that predicts risk from a diff's metadata/shape; ACR is an LLM that actually reads the changed code and classifies it as safe or risky. DRS reverts diffs; ACR lands them.

Why do two RACER runbooks with identical transformations get treated differently?

Because eligibility is earned per-runbook from its own track record — one with a history of reverts is blocked while a clean one proceeds. Because the newer runbook is always trusted more. Because they touch different programming languages. They aren't — identical transformations always get identical treatment.

What does the percentile threshold (e.g. P5 vs P50) control?

How many human reviewers are assigned to each diff. How conservative the risk gate is — P5 lets only the safest 5% of diffs qualify, P50 lets the safest 50% qualify. The delay before a diff lands. The confidence score the AI reviewer must hit.

What's the difference between RADAR Verification and RADAR Approval for human diffs?

Verification waives review entirely; Approval defers it. Verification lets a diff ship now with a human review deferred to after landing; Approval applies stricter criteria to waive human review entirely. They are the same step with different names. Verification is for bots; Approval is for humans.

When Meta relaxed the risk threshold from P25 to P50, what happened?

The approve rate fell and incidents spiked. Nothing changed at all. The auto-approve rate rose to 60.31% while revert and incident rates stayed flat — showing risk and threshold aren't linearly related. Human review was eliminated company-wide.

Why This Paper Matters

For builders and practitioners

If your team is feeling the same squeeze — AI assistants generating more code than anyone can review — RADAR is a concrete, copyable blueprint. The takeaways are practical: stack cheap-to-expensive checks so you fail fast; make trust earned and revocable per source rather than granted by category; always keep a default-safe fallback (route to a human) and a delay-before-landing valve; and expose a single tunable dial (the risk threshold) so each team can pick its own risk appetite. Crucially, it shows you can deploy this incrementally — start at the safest 5%, watch the guardrails, and widen as confidence grows — rather than betting the company on a big-bang rollout.

For the research community

Two findings are load-bearing. First, this inverts how risk models are used: prior work surfaced risk scores to inform human reviewers or route diffs to better ones. RADAR uses the same kind of score to take action — to decide which diffs need no human at all. That's a shift from risk-as-information to risk-as-automation. Second, it's a rare large-scale operational record (535K+ diffs, real production incidents, real revert rates) rather than a benchmark study, which is exactly the kind of evidence the field lacks as AI-generated code floods in. The honest caveats — it's observational, Meta-specific, and the metrics are proxies — are stated plainly, which makes the numbers more trustworthy, not less.

The bigger picture

For decades, the bottleneck in shipping software was writing the code. Generative AI is dissolving that bottleneck and quietly relocating it downstream — to review, testing, and the human accountability that gates production. RADAR is an early, concrete acknowledgment that as machines write more of the code, the systems that review code have to become machines too, or the whole pipeline stalls. The interesting open question the paper raises but doesn't resolve: code review was never only about catching bugs — it's also how engineers learn the codebase and spread knowledge. As more review gets automated, that human knowledge-transfer shrinks. The authors find the trade-off is currently favorable, but flag that it may shift as automation expands — a tension the whole industry will be navigating for years.

Glossary

Diff

Meta's name for one self-contained proposed code change (a "pull request" elsewhere): edited lines, a description, and a test plan.

Landing

Merging a diff into the shared codebase so it ships. The opposite is abandoning the diff.

Code review

Having another engineer read a change before it ships — catches bugs, enforces standards, spreads knowledge.

Phabricator

Meta's code-review and continuous-integration tool. Tracks every diff's metadata, actions, timestamps, and state.

Agentic AI

AI that autonomously carries out multi-step coding tasks — writing, testing, refactoring — then submits diffs like a human.

Revert

Undoing a landed change because it caused a problem. A guardrail metric.

Production Incident (PI)

A logged outage or serious failure in the live product attributable to a change. The most serious negative outcome.

Codemod

A scripted code transformation applied across the codebase. Deterministic (a fixed rule) or LLM-generated.

RACER

Meta's GenAI tool that delegates well-defined coding tasks to an AI agent, which generates diffs from runbooks.

Runbook

A pre-configured RACER prompt for one type of change. Each has its own track record and safety settings.

RADAR

Risk Aware Diff Auto Review — the layered funnel that auto-reviews and lands low-risk diffs.

Diff Risk Score (DRS)

An ML model predicting how likely a diff is to cause an incident, from metadata. Expressed as a percentile threshold.

Automated Code Review (ACR)

An LLM that reads the actual code change and classifies it as safe or risky, making accept/reject calls.

LLM

Large Language Model — AI trained on huge text/code corpora; here used to read and understand code changes.

Funnel

A sequence of safety filters a diff must all pass to auto-qualify. Each layer removes more candidates.

ACE

AI Commit Eligibility — the policy letting bot diffs land with no human review when they pass all safety checks.

RADAR Verification

Step for human diffs: pass and you may ship now with a human review deferred to after landing.

RADAR Approval

Stricter step that waives the deferred review entirely — no human review needed at all.

Percentile threshold (PX)

The risk-score cutoff. P5 = safest 5% qualify; P50 = safest 50%. Lower is more conservative.

SOX-scoped code

Code governing financial reporting (Sarbanes-Oxley). Legally requires human review; RADAR never automates it.

Time to close

End-to-end time from publishing a diff to closing it (landed or abandoned).

Diff review wall time

The clock time a diff spends waiting for and undergoing review — the queue time.