I don't read most of the code my agents write

The first time I coded with an LLM, I copied files out of my editor into a ChatGPT tab and pasted the answer back by hand. Then Cursor’s tab completion showed up and felt like magic: the machine finished my line before I did. Then I learned to point at a file and say exactly what to change, and a good model would mostly do it. Now I describe the result I want and let the agent work out how to get there.

Notice what moved. At every step, the hard part moved further from the keyboard. From typing, to prompting, to giving context, to saying clearly what I want. The skill that mattered kept changing.

This post is a snapshot of my workflow today, June 2026. Not a manifesto, not a how-to guide. It’s a dated diary entry I can argue with in six months, when the models have changed again and half of this looks old.

TL;DR

Specs over code. My most valuable work is documents now: roadmaps, PRDs, issues. Intent has the biggest effect on the result. It is not most of the work.
I read the critical code and skim the rest. I sort the code by how important it is, then trust a fast, automated feedback loop for everything else. How fast that loop runs is the real speed limit.
No full autonomy. I keep product, scope, architecture, and patterns. The agent runs a plan I watched get built: it’s free to act on the execution, not on the decisions.
/ship-epic ships a whole epic on its own, but it’s the riskiest bet here, and I’m already pulling it back toward a queue that waits for a human.
The job I can’t hand off is keeping the slices coherent with each other: senior judgment that no single loop has. It’s why this is a dated diary, not a how-to guide.

One thing to make clear up front, because it sets the limits of everything that follows.

The most valuable thing I do is no longer writing code

Most of my energy now goes into documents. Roadmaps, PRDs, technical decisions, issues. Defining what gets built, how it should behave, where the architectural lines sit. The actual code is the part I think about least.

That sounds like a line from a productivity influencer, so let me be exact about what I am and am not claiming. I used to say something like “90% of the work is intent now.” That number was made up, and it also contradicted itself: if intent were 90% of the work, I wouldn’t need all the machinery I’m about to describe to catch everything intent misses. So I stopped saying it.

Here’s the honest version. Intent is the input with the biggest effect, not the bulk of the work. A clear spec changes the result more, per minute spent, than anything else I touch. It is necessary. It is far from enough.

I’m not alone in this. Addy Osmani has been calling the same thing “intent debt.” His point is that intent is the one input that still has to start with a human. When a Google engineering lead and I reach the same idea on our own, I’ll be honest: that part of my thesis isn’t original. It’s becoming the common view. The original part comes later, and it’s less comfortable.

The other half of the thesis is a belief: I don’t trust full autonomy, and the more important the system, the less of it I hand over. If an agent makes all the decisions, what is the engineer for? I want the agent to ask me things, to raise the decision that matters and wait, not to decide on its own and move on. I stay the supervisor who owns direction, architecture, and the patterns the code is meant to follow. Osmani calls the human “the director of the show.” That’s the job I’m keeping.

Before I write a spec, I argue with myself

Greenfield starts away from the machine. I brainstorm the idea myself, or with colleagues: what to build, what’s out of scope, what’s for a later phase, whether a feature’s complexity is worth the value it gives a user. I can’t stand high-complexity, low-value features, and that judgment is mine. I won’t hand it off.

Then I bring in the LLM, and not as a co-author. As a critic. Find the holes in my reasoning. Ask me the questions I didn’t think to ask but have to answer before anyone writes a line. The product decisions stay with me; the model’s job is to attack them, not make them.

The mechanics are dull. I brain-dump by voice into one long markdown file until I’ve got nothing left. I sketch screens and flows in FigJam. Then I ask the model to turn the dump into specific documents: a roadmap, a PRD, a technical-decisions file. Some of those technical calls I make up front without the model at all, because I know the context and don’t want to give it a free hand on decisions I already have opinions about.

I split the roadmap into self-contained epics: big features that ship on their own and each deliver real value. A recent one broke into seven epics that together made up the MVP.

One epic at a time, and I finish it before I touch the next

Each epic goes through the same routine. I lean on Matt Pocock’s skills here, with GitHub Issues as the tracker.

First I grill myself. Pocock’s grill-with-docs, scoped to a single epic, with the technical-decisions and PRD loaded into context so the output stays tied to the high-level calls I already made. One hard session, up to a hundred questions in a single pass. This is not the shallow version. Pocock built that skill to be relentless, and relentless is the point; the rule I set for myself is “don’t reopen the grill bit by bit across days,” not “ask fewer questions.” It’s one deep pass, not one shallow question.

Then /to-prd turns the session into a structured PRD full of user stories and requirements, but only while the context window still has room to do it well. This is the document I read, every word. It’s the last reliable moment to confirm we’re building what I meant. Then /to-issues cuts the PRD into tasks, which is the closest thing left to a coding session.

The part people argue with is that I refuse to grill the next epic until this one is fully built. Grill, PRD, issues, build everything, review, refactor. Then the next epic.

The reason is drift. During the build, things change. Those changes feed back into the roadmap and the PRD, and I want the next epic’s grilling to sit on top of the final, real build of the last one, not a plan that’s already three commits out of date. So I never grill backward. When a later epic shows that an earlier PRD was wrong, I don’t reopen a grilling session. I treat it as a normal code change plus a PRD update to keep the document matching the code. The forward order exists so I rarely have to look back.

I’ll admit the weak spot, because a critic will find it anyway: this assumes the early epics are solid enough that late fixes are cheap edits. That’s true when the boundaries are clean. If epic one is a foundational data model that epic six breaks, the “cheap edit” gets expensive. I put the riskiest foundational work early for that reason, and I won’t pretend the risk is zero.

I don’t read most of the code, and that’s the part I’ll defend

Here’s the uncomfortable one.

The agent produces more than I can read, by a wide margin. So I made a deliberate trade: I read the code that’s critical and I skim the rest, while watching the architecture the whole time. Mind-to-architecture, not mind-to-code. I hold the system in my head at the level of modules, patterns, and boundaries, not every line, and I trust the PRD, the tests, and CI to cover the line-by-line detail.

Someone I respect thinks this is exactly where I’m wrong, and I’m going to quote him rather than hide him.

He’s telling me that the comprehension I’ve relabeled as a bottleneck is the actual work, and that I’ve left the real work behind.

They might be right. So let me say where they’re right. The failure I’m most afraid of is real: the edge-case bug, the security hole, the timing race that passes every green test and lives forever in code I never read. That is the classic way “trust the loop” gets you killed, and “the loop should have caught it” is not an answer. It’s the very thing in question. This is Osmani’s hard 30%: the part AI gets you to fast and then can’t finish.

But notice the shape of the disagreement, because it’s narrower than it looks. I’m not arguing that code doesn’t matter or that understanding it is optional. I’m arguing about which lines deserve line-by-line attention. Osmani reads more; I sort by how critical the code is and lean harder on the feedback loop for the rest. That’s a real argument with a real answer on each side, not “read or don’t read.” The honest version is: I made a conscious bet on the feedback loop for the non-critical majority, and I’ve named the weak point I’m betting around instead of pretending it isn’t there. If you take one thing from this post, take the sorting by importance, not “stop reading code,” which is the easy, distorted version of my own position and the one I’d attack if someone else wrote it.

The rate of feedback is your speed limit

That bet only works because of what sits under it. The rule I keep coming back to:

For an agent to run without me watching over it, it needs a tight, fully automated loop: linting, formatting, type-checking, unit tests, integration and e2e tests, wired so the agent gets the result itself, without me in the middle. I wire it through git hooks and harness hooks: pre-commit, pre-push, all of it. The agent commits, the loop runs, the agent sees what broke and fixes it before it ever reaches me.

This is the dull infrastructure that makes everything above sound less reckless than it is. The faster and more trustworthy that loop, the more freedom I can give the agent. The looser it is, the more I have to read by hand. The loop is not a nice-to-have bolted on at the end. It’s the thing that sets how much autonomy is safe to grant in the first place.

The skill that ships an epic while I’m away from the keyboard

Now the apparent contradiction. I just told you I don’t trust full autonomy. I also have a skill called /ship-epic that takes a whole epic and ships it end to end, on its own, for hours.

A sharp reader catches that, so let me not bury it.

What it does: a thin coordinator that never writes code itself. For each child issue, in order:

Sync main, then spawn a fresh subagent to build the issue with TDD on its own branch.
Open a PR and wait for CI. If it’s red, spawn a fix subagent with the failing logs. Never skip, never give up. After two failures it spawns a re-planning subagent to try a different approach.
Merge, comment progress on the parent issue, move to the next.

The state lives in GitHub, in the run-start commit and the per-merge progress comments, not in the context window, so it survives compaction and can recover by reading the issue. It never merges red CI, never pushes to main directly, and closes the parent epic with an architecture pass, a review pass, and a structural cleanup once the last issue lands.

Here’s why it isn’t quite the contradiction it looks like. I grant autonomy only at the execution layer, over a scope I already locked upstream. The agent never decides what to build or why. I did that, in the brainstorm, the grill, the PRD, the issues. It runs, on its own, a plan a human already supervised into being. Anti-autonomy on the decisions; pro-autonomy on the execution. And notice the shape: it isn’t really a loop, it’s a queue, ordered child issues picked off one at a time, the way a dev team has always worked. Pocock is right that “agentic loops” are mostly away-from-keyboard work wearing a more exciting name.

I changed my mind about /ship-epic while writing this. It’s an experiment, and it’s the thing I’m least sure about in this post. It works. I’m happy with what it produced, and happy I built it. But it’s not something I’d run blindly on every epic, and I can see myself dropping it. Running it taught me the opposite of what I set out to prove: the checkpoint I removed wants to come back. A queue with a human gate I push back later as trust grows beats a loop with no gate at all. The real value was never in the unattended run anyway. It was always in everything before it: the brainstorm, the grill, the PRD. The autonomous run is the cheap part dressed up as the impressive part.

This reopens the comprehension problem, harder. Now nobody reads the PR before it merges. Only CI gates it, on code I’ll at best skim.

And it runs without a sandbox: an unattended agent with merge rights and full access to my machine, which is exactly the thing you’re supposed to isolate, and I haven’t. I know it’s a hole, and on a bad day a dangerous one. It’s on the list and not done, and I’d rather name it than pretend the net is complete.

I’m not going to wave any of that away. The answer is the same: sort by importance, plus the review at the end as a backstop. Critical code gets read, the loop covers the rest, and the architecture and review passes at the end of the run catch what slipped through. But the real safety net here isn’t the test suite. It’s a person. Which brings me to where this whole thing breaks.

Where it breaks

Two confessions.

The first: for a while I thought I didn’t need an IDE anymore. The agents do the typing, so what’s the editor for? I was wrong. I need the IDE, just not for typing. I need it to control how the project grows and to keep a live map of the structure in my head. The moment I let that map go stale, the architecture on disk drifts away from the architecture in my head, and the whole mind-to-architecture bet stops paying off, because the thing I’m supervising is no longer the thing that exists. The IDE is how I keep the two in sync. I thought I didn’t need an IDE anymore. I was wrong.

The second is deeper, and it’s the honest answer to “what could go wrong with /ship-epic.” The danger isn’t inside any single loop. Each slice passes CI, passes review, looks fine. The mess builds up in the connections between the slices: the architecture that forms across issues, which no single loop owns. Without senior judgment watching over that, the gaps fill up with mess.

So my real job, the one I can’t hand off, never stops: hunting for ways to improve the codebase and doing them as I find them, setting patterns and guardrails so the next feature has a clear path to follow. The loop ships features. I keep them coherent with each other. That’s the core senior-developer work, and it’s what someone without the experience can’t supply.

Which is why none of this is a how-to guide. Every skill here can be copied. You could take my /goal, my /ship-epic, my hooks tomorrow. What you can’t copy is the judgment about when to trust them: which code is critical enough to read, how deep to grill, when a feature is too messy to let the loop run, when the mess between slices is about to pile up. The prompts aren’t the value. The experience to read the situation is.

The three days my edge disappeared

I want to end on the thing that scares me a little, because a diary that only flatters its author isn’t worth keeping.

Everything I just called impossible to hand off — the senior judgment, the keeping-the-loops-coherent work, the reason this can’t be a how-to guide — I watched a model do for me. For about three days.

In early June, Anthropic released Fable 5 and Mythos 5, two of its most powerful models. I had Fable 5 for a few days before it disappeared. And in those few days, it held the architecture together across autonomous slices in a way today’s models drop on the floor. The mess between the loops, the part I just told you was my job alone, it was filling that in. Oh, this can’t last, I remember thinking. It didn’t.

On June 12, Anthropic received a US-government export-control order on national-security grounds and shut both models down, first framed as foreign-national access but in practice for everyone, to comply. About three days live, then gone. (Anthropic’s statement.)

And the detail I can’t stop thinking about. The capability they flagged, the thing serious enough to pull a model used by millions, was, in Anthropic’s own words:

That’s my workflow. That’s /ship-epic. The model got good enough at the exact thing this whole post is about that the government took it away.

So I’m not going to tell you experience is a permanent edge. It matters today, with these models. I watched a frontier model take a chunk of it away for three days, and then watched the evidence get deleted. That’s why this is a dated entry and not a manifesto. The only honest claim is about right now.

The one question this series will track

If I’m going to keep a diary, it needs one question I answer every time, so future entries can separate “the model got better” from “I changed how I work.”

Each entry answers three things:

What did the agent take over since last time, and was that because the model got good enough, or because I changed my own process and trust?
Where is the human still essential: the decisions and reads I refuse to hand off?
Where did it break: what burned me this time?

This year’s baseline: the agent took over TDD building and most line-by-line reading. I experimented with handing it full-epic execution on its own, and I’m already pulling that checkpoint back, toward a queue I gate by hand and loosen as trust grows. I’m still essential on product and scope, architecture, the grilling answers, the critical-code review, and the call on whether to grant autonomy at all. And it burned me by letting the codebase grow out from under my mental map until I picked the IDE back up.

Ask me again in six months. The honest bet is that the line moves, and that writing down where it sits today is the only way I’ll know whether it was the model that changed, or me.

I gave my blog a Markdown twin for agents