Review the Spec, Not the Diff: How AI Changes the Way We Build Software

Something quietly broke in how we build software, and most teams have not adjusted to it yet. For decades the safeguard that kept code honest was review at the end: a person read the diff before it merged. AI did not make that safeguard better or worse. It made it impractical at the scale we now produce code. What is replacing it is a different shape for the whole lifecycle, and it is worth understanding before it arrives by default.

The shift fits in one line:

Review the spec heavily, the code lightly.

It sounds like a corner being cut. It is the opposite. It is what rigor looks like once a machine, not a person, is doing most of the typing.

One clarification before going further, because the word spec is overloaded. The spec this piece cares about is the technical one: the proposed approach and how the change will be built. It is not the product spec — the brief that decides what to build and why. Product requirements are the input to this step, not its subject. The document under review is the engineer's plan for satisfying them, and that technical plan is what gets read heavily before any code exists.

The thing that broke

Writing code used to be the slow and expensive part of the job. Review existed to catch the mistakes people made while doing that slow work, so it sat at the end, right before merge. It worked because code arrived at roughly the speed a person could read it. The producer and the reviewer ran at the same pace.

AI breaks that balance. Code now gets produced far faster than anyone can read it line by line. Writing got cheap and the volume went up, so reading all of it got more expensive, not less. The reviewer can no longer keep up with the producer, and no amount of discipline closes a gap that is growing by orders of magnitude.

        WHERE THE COST LIVES

  BEFORE — humans write the code
  ───────────────────────────────────────────────
  spec    ░░░
  build   ████████████   slow: writing is the bottleneck
  review  ██████████      feasible: code arrives at reading speed

  NOW — AI writes most of the code
  ───────────────────────────────────────────────
  spec    ██████████      invest here, up front
  build   ░░░             fast and cheap
  review  ████████████████  more code, arriving faster than
                            anyone can read it, so reviewing
                            every line by hand costs more than
                            before, not less.

This is the part people misread. Manual code review did not get cheaper. It got more expensive, and trying to read every line by hand is the thing that no longer fits. So the effort moves to where a small amount of human attention still changes the outcome: the decision about what to build and how to structure it, made before any code exists.

A wrong approach caught in a spec is a five-minute edit. The same approach caught after a pile of generated code already exists is a day of rework, or it never gets caught at all because the review queue is hopeless. The leverage moved to the front of the process because that is the last point where reviewing a little still covers a lot.

The shape the lifecycle is taking

When you follow that logic through, the lifecycle reorganizes itself. The center of gravity becomes the spec and its review. Everything after exists to make that early decision pay off.

   ┌─────────────────────────────────────────────────────────────┐
   │             THE EMERGING LIFECYCLE  (7 steps)                │
   └─────────────────────────────────────────────────────────────┘

   1. SPEC                A short TECHNICAL spec: the problem and
      │                   why in a line or two, then the proposed
      │                   approach — how the change will be built.
      │                   The engineering plan, not the product brief.
      ▼
   2. SPEC REVIEW   ◀══════ ★ THE MAIN QUALITY GATE ★
      │                   A peer reviews the technical spec BEFORE
      │                   any code. Align on the approach, kill wrong
      │                   ideas while they're cheap to change.
      ▼
   3. BUILD               Implement against the approved spec.
      │                   AI does most of the typing. A human
      │                   owns and understands every line.
      ▼
   4. ACCEPTANCE CHECK    (only if there's UX to try)
      │   ··· optional    Someone tries the feature and confirms
      │                   it does what the spec intended. A check
      │                   on outcome, not on code.
      ▼
   5. SHIP                Automated review runs on everything.
      │                   Human review only for high-stakes
      │                   changes ↓
      │                   ┌─────────────────────────────────┐
      │                   │ HIGH-STAKES?  → human review     │
      │                   │ required before merge            │
      │                   └─────────────────────────────────┘
      ▼
   6. VALIDATE            Confirm it works in staging, then in
      │                   production, before announcing it.
      ▼
   7. ANNOUNCE            Tell people once it is validated.

The steps matter less than the move underneath them: the heavy human gate shifts from the end of the process to the beginning, and from the code to the intent.

Two reviewers, not one

The reason this is safe and not reckless is that review does not disappear. It splits in two, by what each kind of reviewer is actually good at.

Machines review the code. Static analysis, security scanning, dependency checks, and tests run on every change, every time, with no exceptions. This is the only kind of review that scales with the volume AI produces, and it is the layer that catches the implementation-level defects a spec review can never see: injection, a missing access check, an unsafe dependency, a broken edge case. It is the floor under everything.

Humans review the intent, and the code only where a mistake is expensive. Human attention is the scarce, costly resource now, so you spend it where the blast radius is high, and let automation plus the spec review cover the rest.

        Every change → AUTOMATED REVIEW
        (static analysis, security scan, tests)
                  │   always, no exceptions
                  ▼
        ┌───────────────────────┐
        │ Is this a HIGH-STAKES  │
        │ change?                │
        └───────────┬───────────┘
            yes  ┌──┴──┐  no
                 ▼     ▼
   ┌─────────────────┐  ┌──────────────────────┐
   │ + HUMAN REVIEW  │  │ Ship independently    │
   │   before merge  │  │ (automated review     │
   │                 │  │ only)                 │
   └────────┬────────┘  └───────────┬──────────┘
            └──────────┬────────────┘
                       ▼
                   VALIDATE

   High-stakes is about blast radius, not feature size:
     • Money movement
     • Authentication and authorization
     • Personal or sensitive data
     • The shared services everything else depends on

The point is to weight the process by risk instead of applying one rule to all code. A change with a small blast radius should not pay the same tax as one that can move money or leak data. Most changes are not high-stakes, so most changes ship on automation alone.

What humans are actually for

Once machines own the mechanical checks, the human review that remains is not about style. It is about judgment that automation cannot supply.

   ┌────────────────────────────────────────────────┐
   │  WHAT A HUMAN REVIEW IS FOR                     │
   ├────────────────────────────────────────────────┤
   │  ✓ Is this solving the RIGHT problem?           │
   │  ✓ Is the approach SOUND?                       │
   │  ✓ Will it behave correctly FOR THE USER?       │
   │  ✓ Does the architecture & intent hold up?      │
   ├────────────────────────────────────────────────┤
   │  ✗ NOT: brace style, naming, line-by-line       │
   │     method critique (automation does this)      │
   └────────────────────────────────────────────────┘

These are questions a tool cannot answer and a model cannot be trusted to answer about its own work. Everything below them, the mechanical layer, belongs to automation.

Prevent, do not just catch

The mechanical layer has an even better home than automation: the model's own instructions. Conventions like naming, structure, and house style belong in the agent's standing instructions, the CLAUDE.md or its equivalent, so the AI writes to them in the first draft instead of producing them wrong and relying on a linter to flag them.

This reframes a whole category of feedback. A nitpick that keeps reappearing is not a problem with the diff, it is a gap in the instructions. Fix the instructions once and that class of mistake stops being generated at all, instead of being caught on every change for the life of the project. The cheapest defect is the one the model never writes.

What does not change: ownership

A process that lets people ship without line-by-line human review only works if accountability is explicit. Three things have to stay true, and they are the same whether or not an AI is involved.

Keep the spec short. A spec that runs long does not get read properly, which defeats the purpose. State the problem, the approach, and the outcome, and stop.
Read your own spec before you share it. Do not pass along something you have not read yourself, AI-drafted or not. You own it.
Understand the code you ship. If a model wrote it, you still have to understand it. "The AI wrote it" is not an answer in production.

What these guard against is the real failure mode of AI-assisted work, which is not slow code but unowned code: specs nobody read and code nobody understands, shipped because a machine produced it. Autonomy and trust go up only because accountability goes up with them.

Before and after

   ┌──────────────────────────┬──────────────────────────┐
   │         BEFORE           │           NOW            │
   ├──────────────────────────┼──────────────────────────┤
   │ Humans review CODE at    │ Humans agree on the       │
   │ the end                  │ APPROACH up front         │
   ├──────────────────────────┼──────────────────────────┤
   │ Often no agreement on    │ Approach aligned before   │
   │ the approach first       │ any code exists           │
   ├──────────────────────────┼──────────────────────────┤
   │ Rework when the          │ Ship independently once   │
   │ approach was wrong       │ the spec is approved      │
   ├──────────────────────────┼──────────────────────────┤
   │ Some changes shipped     │ Automation reviews all    │
   │ with no review at all    │ code; humans review only  │
   │                          │ the high-stakes changes   │
   └──────────────────────────┴──────────────────────────┘

The old model had a quiet failure. The heavy gate at the end was the one people skipped under delivery pressure, so teams often got the friction of review and unreviewed code at the same time. The new model trades one unreliable late gate for two dependable ones: automation on everything, and human judgment where the stakes are high.

Where this goes

Step back from any single team's process, and three shifts show up wherever AI does the coding.

The bottleneck moved from writing code to deciding what to write. When code is cheap and plentiful, reading all of it by hand stops being possible, so machines take the line-level review and humans concentrate on getting the intent right.

Controls get weighted by risk instead of applied evenly. Heavy attention on the spec and on the few high-stakes areas, automation everywhere, and a light touch on the rest.

Prevention beats detection. The earlier you push quality, into the spec, into the model's instructions, the less you pay to catch the same mistake over and over downstream.

None of this lowers the bar. It moves the bar to where the work now happens. "Review the spec heavily, the code lightly" is not a relaxation of standards. It is a relocation of them, to the front of the process, where a small amount of human judgment still decides whether the next thousand lines were worth writing.