AI, ML & Dev Productivity

Writing Code Is Free; Judgement Isn't

June 2, 2026

Agents will happily fly through your backlog in parallel shouting "Absolutely!" The scarce resource is now the judgment to decide whether the work is worth doing and whether the answer you got back is the right one.

Agents aren’t going to stop and ask, “Should we really build this?” They’ll happily fly through your backlog in parallel shouting “Absolutely!” Your only limit is your token budget. The scarce resource is now the judgment to know whether the work is worth doing and whether the answer you got back is the right one.

THE SCARCE RESOURCE HAS MOVEDBEFORE AGENTSScarce: typing the codeCheap: deciding whatBottleneck: handsOutput limited byhow fast you ship linesSHIFTNOWCheap: writing the codeScarce: judgment about itBottleneck: token budgetShould we build this?Is the answer the right one?Agents say “Absolutely!” — you have to know when they shouldn’t.

The project

I created a JIRA ticket titled “Datadog triage agent” and assigned it to an engineer. The title, plus a recent stint on the on-call rotation, pointed them at a reasonable interpretation: our Datadog alerts post to Slack, so have an agent triage those notifications as they come in. Nothing wrong with that, and still something worth tackling.

After some shuffling, I took the project over. I wanted to look at it with fresh eyes, and challenge the assumption that our existing monitors were the right ones.

Let’s add better monitors!

My first idea was to raise the quality of what we alert on. Rather than rely on coarse HTTP 500-level errors, introduce a service-level objective (SLO) for our most critical Workflow Automator flows: creating a workflow, creating a workflow trigger, activating a workflow, running a workflow, and diagnosing a failing run. Add fine-grained error reasons so that humans and agents triaging alerts would know exactly where in the code to go.

I wrote the tickets and had our in-house coding agent implement both the coarse 500-level errors and the fine-grained instrumentation for the many ways each of these flows can fail.

This is getting brittle fast

When the agent and I started testing, we were missing a lot of failure scenarios. The agent wanted more code: now we need to go instrument this other service, and then this one over here. More code, more complexity, more maintenance. We were adding instrumentation code to the most sensitive areas of our system, it was brittle, and it would almost certainly drift as we kept changing those areas.

So I stepped back again. The agent fell back to the coarse 500-level API errors. An agent or engineer investigating an incident would have less to go on, but we would no longer have to keep fragile instrumentation current. That matters: this project is meant to reduce risk, and adding code to sensitive areas that likely goes stale and becomes misleading increases it.

I put up a PR to switch to the 500-level monitors, and then hesitated. If we ship these, how do we know they improve our ability to detect customer-impacting incidents?

Are these even the right monitors?

To answer that, I went after all of them. We have over 4,200 production Datadog monitors. I had Codex and Claude each categorize every one: keep, leave, modify, or combine.

The results came back with very little agreement between the two agents.

Almost as an afterthought, I had also fed in the last six months of incident postmortems. The agreement there was much stronger, and the signal was much higher.

Auditing 4,200 monitors in the abstract is low signal. The postmortems are the primary sources, the actual evidence.

Start from the incidents

So I pulled every incident postmortem from the past year, 68 in total, and had Codex and Claude focus on them in fresh context windows. I asked them to categorize each failure signature (e.g., server error, latency degradation, third-party outage), detection source, and the ideal detection mechanism. Then I had a third agent reconcile the two outputs. Agreement was >80%, and where they disagreed it was the kind of judgment call two engineers could reasonably split on.

WHO CAUGHT THE INCIDENT FIRST (LAST 12 MOS, N=68)68incidents38%monitors62%customers/staffPIPELINE1. Pull 68 postmortems (past year)2. Codex categorizes in fresh context3. Claude categorizes in fresh context4. Third agent reconciles disagreementsSIGNATURE FIELDS• failure class (5xx, latency, 3p, data, …)• detection source (monitor vs human)• ideal detector for this signature• agreement > 80% across both agents

The result: Only 38% of our incidents over the past year were caught by monitors. Customers or internal employees caught the other 62%.

The output included a table of the detector we could have had and the number of incidents it would have covered. The top three came back as:

  1. Datadog Real User Monitoring (RUM) monitors
  2. Async pipeline health
  3. Third-party outage detection

The agent offered to draft a plan for the top three.

The right data, and the wrong conclusion

The table was missing one column: how many incidents in each class we already catch with a monitor today. I pointed this out. The agent told me how brilliant I was for catching it.

Third-party outage detection was third. Once we accounted for existing coverage, it dropped to second to last, because we already monitor third-party outages well. Had I taken the first ranking at face value, I would have spent days or weeks adding detection across our 80-plus vendors, only to catch one additional incident in the past year, instead of spending it where it could catch ten times that.

What the data actually said

The 38/62 split invalidated most of the path I had been on:

  1. Triage the alerts we have. Does nothing for the 62% we never alerted on.
  2. SLOs on critical flows. Touches a tiny slice of our surface area.
  3. 500-level errors. Would have caught one incident out of 68, because we already cover that case reasonably well.

The split told us how much we were missing, not where. Ranked by the gap between incidents caused and incidents our monitors caught:

Class of causeIncidentsCaught by monitorsGap
Datadog RUM17512
Async pipeline health (EventBridge, SQS, background jobs)1569
Data correctness707

Data correctness was near the bottom when just considering incidents caused. Once we factored in incidents we already catch with monitors, it moved to third place.

DETECTION GAP BY FAILURE CLASSCLASSINCIDENTSGAP (UNDETECTED)RUM (frontend)1712 missedAsync pipeline159 missedData correctness77 missedThird-party outage101 missed — already coveredRANK BY GAP, NOT BY VOLUME.Third-party drops to last when existing coverage is included.

Agents haven’t replaced judgment

The agents did good work. They categorized 68 postmortems, reconciled their own disagreements, and boiled it down to a simple table with a reasonable plan. The table was a bit too simple, though.

The agent had enough context to catch it themselves. It could have said, before ranking anything, “we should compare incidents caused against incidents we already catch.” It didn’t.

INTERROGATE THE ANSWERAGENT HANDS YOU1. RUM monitors2. Async pipeline health3. Third-party outage”Plan for the top three?”Looks reasonable.Could cost weeks.ASKTHREE QUESTIONS BEFORE YOU ACTQ1What dimension did you rank on?“incidents caused” ≠ “incidents undetected”Q2Does that dimension map to impact?For monitors, the gap is what’s missingQ3What context is missing from the table?“existing coverage” was the missing columnThe code to act is free. Deciding it’s the right answer isn’t.

Interrogate agent responses. When they hand you a list, ask what dimension it ranked on and whether that dimension actually maps to impact. “Incidents caused” is not the same as “incidents we aren’t detecting,” and only the second one tells you where to invest. Ask what context is missing. The code to act on the answer is increasingly free. Deciding whether it is the right answer is not, and that is where our jobs are moving.

← All chapters