In February 2025, AI created 0% of our merged pull requests. By November, agents were creating 15%. Then Opus 4.5 and GPT 5.2 hit a tipping point. We rolled out Claude Code and Codex enterprise plans to every engineer. Within three months, agents were writing 75% of merged PRs.
That graph changed everything about how we think about software engineering. Seemingly overnight, agents write the majority of our code. But our organizational and technical infrastructure hasn’t kept pace. The bottleneck moved.
It used to be: How do we write code faster?
Now it is: How do we prove the code works without a human reading every line?
This chapter is about the bet I’m taking for my team. Not “how to use AI more” but something harder: how to build the verification system that makes autonomous delivery possible. If you’re leading an engineering team where agents are writing significant code, this is the playbook.
The five levels of coding autonomy
A useful mental model from Dan Shapiro’s “Five Levels” framework. Level 0 is autocomplete. Level 5 is the factory: agents write the code, the tests, the reviews, deploy, and monitor.
Most teams are at Level 3 or 4. Engineers use Claude Code daily. Agents write most of the code. But humans are still the primary reviewers, the primary deployers, the primary on-call responders. Every step between here and Level 5 involves the same question:
What capability must exist in the harness before we can transition this step to agent-driven?
That question - not “how do we use AI more?” - is the strategic frame for every technical bet.
Code implementation is already agent-heavy. But look at the rest: PR review, deploy, on-call, remediation - all human-gated. Each one transitions to agent-driven only when the harness provides sufficient verification. The harness is the unlock for everything downstream.
What is the harness
The harness is the combined system that proves agent output meets expectations without a human reading every line. It has four components:
Context - everything agents need to understand the system: architecture maps, runbooks, API docs, coding rules, domain knowledge. If it lives in a Slack thread or someone’s head, it doesn’t exist. If something matters, it lives in the repo.
Constraints - mechanical boundaries agents cannot cross: linters, type systems, permission gates, architectural rules. Not guidelines that agents might ignore, but hard walls enforced by tooling.
Verification - proving the output works: end-to-end scenarios, integration tests, holdout test suites that agents can’t trivially overfit, performance gates, satisfaction scoring for fuzzy outcomes.
Feedback loops - the system gets smarter over time: incidents become scenario tests, code review comments become linter rules, production monitors feed back into the harness. An incident that doesn’t become a test is a missed compounding opportunity.
The bottleneck is always the harness. As OpenAI’s team put it: “Our most difficult challenges now center on designing environments, feedback loops, and control systems.” Generation is solved. Verification is the hard problem.
Six principles for decision-making
Every technical investment, every tool choice, every process change gets evaluated against these principles. They’re ordered by priority.
1. Verification over generation
We can already generate code quickly. The constraint is proving it works and ensuring each feature justifies its ongoing maintenance cost. Every investment should be evaluated by how much it increases confidence in agent output, not how much faster agents write code.
2. What agents can’t see doesn’t exist
Slack threads, design decisions in someone’s head, tribal knowledge from three years ago - none of it exists to an agent. If something matters, it lives in the repo as documentation. Think of repo docs as onboarding material for your agent workforce.
3. Encode judgment into the harness
Human judgment about code quality, architecture, and correctness is expensive to apply at scale. When you identify a pattern, a constraint, or a quality standard, encode it: linter rules, test scenarios, architectural boundaries, performance SLOs. Apply it once mechanically instead of catching it in every PR review.
4. Behavior over code
Don’t rely on humans reading diffs as the primary quality gate. Rely on externally observable behavior proven by the harness - end-to-end scenarios that exercise the real system. Humans still intervene when risk is high or harness coverage is low, but the default is automated verification.
5. Close the loop
Every system that generates information - incidents, alerts, test failures, customer feedback - should feed back into the harness. An incident that doesn’t become a scenario test is a missed compounding opportunity. A code review comment that doesn’t become a linter rule will be repeated forever.
6. Legacy code is a specification
Your legacy monolith isn’t an obstacle to autonomous delivery. It’s a living specification. Given a comprehensive enough verification harness, agents could rewrite subsystems from scratch. The harness defines the contract. The implementation is fungible.
What changes, what stays the same
Some things we currently do exist to serve humans and become less relevant:
Code style debates become irrelevant when agents rewrite freely. Architectural boundaries matter. Formatting preferences don’t.
PR review as quality gate gives way to automated verification. Human review shifts to reviewing specs, scenarios, and harness coverage - not diffs.
Coordination overhead decreases when agents batch PRs, manage feature flags, stack changes, and work 24 hours a day.
Some things become more important:
Specification quality. The spec is the primary human artifact now. If the spec is wrong, the code will be wrong at speed. A vague spec with human engineers meant slow, messy code. A vague spec with agents means fast, confidently wrong code.
Scenario coverage. The breadth and realism of your end-to-end scenarios determines your confidence ceiling. This is now the most important technical investment a team can make.
Architectural constraints. Rigid boundaries, enforced mechanically, are what allow agents to move fast safely. The more you can encode into the system, the less you need humans in the loop.
Observability. Agents need logs, metrics, and traces - but programmatically accessible, not through dashboards. If your monitoring requires a human staring at Grafana, agents can’t use it for automated remediation.
The risks worth tracking
Cognitive debt
The gap between system complexity and human understanding. As agents change code faster than humans can comprehend it, you accumulate a compounding comprehension gap. Mitigations: understand the system through specs, scenarios, and architecture docs, not code. The code is allowed to be opaque, just like compiled binaries. Keep the behavior layer current. Use agents to generate documentation - they can read the entire codebase each time.
Security and correctness
Agent-written code can introduce bugs, security vulnerabilities, performance degradation, and data loss. Mitigations: encode security rules that get picked up by every agent, create dedicated review agents with security-specific prompts, start with low-risk internal tools and expand to external-facing code as you gain confidence.
Over-speccing
When anyone in the organization can ask agents to add features, you risk a software explosion. Each new requirement is simple in isolation, but complexity compounds. Even with free code generation, every feature has an ongoing maintenance and verification cost. The highest quality, fastest software is no software at all.
Token costs
Agent-heavy teams carry a significant AI budget across Claude Code, Codex, and other inference providers. As adoption grows, this scales. But with three competing frontier providers and market forces, prices stay competitive. And open-source models, though they lag frontier performance, exist at a fraction of the cost for many subtasks like compaction, validation, and code search.
The job description is changing
The engineer’s job shifts from writing and reviewing code to:
- Specifying intent - defining what should change, clearly enough that agents can converge without clarification
- Building verification - scenarios, holdouts, satisfaction scoring, performance gates
- Designing feedback loops - incident to scenario to prevention, automatically
- Supervising execution - orchestrating agents, escalating when harness coverage is low
We have decades of experience building software where humans write the code. Most of our engineering best practices, tools, and organizational structures assume that. Agents now write the majority of code, and that percentage will approach 100%.
The companies that build the best harnesses will ship faster and with higher quality. The harness is the product. Let’s build.
References
- The Five Levels: from Spicy Autocomplete to the Dark Factory - Dan Shapiro’s framework for coding autonomy levels
- Harness Engineering: leveraging Codex in an agent-first world - OpenAI’s take on building verification systems for agent output
- StrongDM Software Factory - StrongDM’s implementation of the software factory pattern
- How StrongDM’s AI team builds serious software without even looking at the code - Simon Willison’s deep dive into the factory approach
- The Future of Software Engineering - ThoughtWorks retreat key takeaways on where the industry is heading