Building Autonomous Development Skills for Claude Code
Building Autonomous Development Skills for Claude Code
We've been running Claude Code as our primary development tool for a few months now. Early on we noticed a pattern: for any non-trivial feature, we'd spend more time re-explaining project conventions to the agent than actually building the thing. Agents would guess at interfaces, forget to register services, and produce code that compiled but was wired up wrong.
So we built two skills that changed the game: /plan-work (produces a structured plan) and /develop (takes a spec and autonomously produces a ready-to-push branch). Here's how we built them, where we had to get creative, and how you can build your own.
What These Skills Actually Do
/plan-work takes a feature request, bug report, or improvement idea and produces a plan. Not a vague "here's roughly what we should do" plan — a contract. Every file to create or modify, every test to write, every wiring point to connect, and which agents handle each phase. It stops at the plan. No code.
/develop takes a spec (or even a rough idea) and chains the whole thing together: setup, planning, implementation, validation, audit, quality review, push. No manual checkpoints. It escalates only when it's genuinely stuck.
The relationship is simple: /develop calls /plan-work as its Phase 1. Standalone /plan-work is for when you want to think before you build.
Why Plans Matter More Than You'd Think
Without a plan, agents guess. Their guesses are inconsistent with each other. A backend agent invents createUser(name, email). A frontend agent calls createUser(userData). Both report DONE. Nothing compiles.
The plan eliminates this by making every interface explicit before any code is written. It's a contract between the orchestrator and the agents that execute the work.
The Basics
Skill File Structure
Skills live in .claude/skills/ as SKILL.md files with YAML frontmatter:
.claude/skills/
├── plan-work/
│ ├── SKILL.md
│ └── references/
│ ├── templates/
│ │ ├── feature.md
│ │ ├── bug-fix.md
│ │ └── improvement.md
│ └── wiring-checklist.md
└── develop/
└── SKILL.mdAgents (the specialists that do the heavy lifting) live in .claude/agents/ as markdown files.
Frontmatter
Every skill starts with YAML that tells Claude Code when to trigger:
---
name: plan-work
description: "Plan any work — features, bug fixes, improvements.
Trigger for 'plan a feature', 'I need to build', 'fix this bug'.
Do NOT trigger when the user says 'implement plan' or 'start building'
— those mean execute, not plan."
---The Do NOT trigger part is critical. Without explicit non-triggers, scope bleeds between skills and you get plans when you wanted code.
Building /plan-work
Build the planner first. The orchestrator depends on it, and you need the planner to be solid before layering automation on top.
Step 1: The Wiring Checklist
This is the single most valuable artifact we created. It's a table of every registration and configuration point in your codebase that a new feature might need:
# Wiring Checklist
| # | Wiring Point | File | When Needed |
|---|---|---|---|
| W1 | DI module created | src/modules/Feature.ts | Any new services |
| W2 | DI module registered | src/app.ts | Always when W1 |
| W3 | Route registered | src/routes/index.ts | Any new endpoint |
| W4 | Migration file | db/migrations/ | Any schema change |
| W5 | Test factory updated | test/factories/ | Any new entity |Why does this matter so much? The most common failure mode in autonomous development is missing wiring. The agent creates a beautiful service, then forgets to register it. The wiring checklist catches this mechanically — no LLM judgment required.
For each applicable W#, the plan validation step just runs:
grep -nF 'W3' plan.md # Does the plan mention route registration?Missing in the plan means missing in the implementation. Every time.
Step 2: Plan Templates
Create one template per work type. A feature template needs these sections at minimum:
- Summary — what, who benefits, which layers
- Requirements Inventory — every discrete requirement in a table
- Behavioral Rules — non-obvious business logic agents must get right
- Reference Feature — closest existing feature as a pattern
- Data Model — tables, columns, FKs, migration file (if DB in scope)
- Tests FIRST — test cases before implementation, per layer. Every row names the scenario.
- Implementation — ordered file list that makes the tests pass
- File Manifest — every file to create or modify. If it's not listed, no agent touches it.
- Wiring Checklist — every applicable
W#from your checklist - Execution Strategy — how work parallelizes, which agent handles each phase
- Verification — test commands per layer
Bug fix and improvement templates are simpler variants of the same idea.
Step 3: The Anti-Patterns List
This one cost us a few iterations to get right. Plans need to be explicit enough that an agent reading a single row in isolation can act on it without guessing. These patterns are plan failures:
- "TBD", "TODO", "implement later"
- "add appropriate error handling"
- "write tests for the above" (name the actual test scenarios)
- "similar to X" without specifying what's the same and what differs
- Prose where a signature belongs — "the service should accept an ID and return the user" is not a spec. Write
async findById(id: UserId): Promise<User | null>instead.
The rule of thumb: if an agent reading only this row had to make a judgment call, the row is incomplete.
Step 4: Plan Validation
Before the plan is presented, validate it. Three categories of checks:
MECHANICAL (binary, no interpretation):
# Any hit = plan failure, not a warning
grep -nEi '\b(TBD|TODO|FIXME)\b|similar to|wire it up|add appropriate|write tests for the above' plan.mdSTRUCTURAL (grep extracts candidates, LLM interprets):
# Extract all function declarations, verify references match
grep -nE 'function [A-Za-z]+\(' plan.md
# Check each applicable wiring point appears
grep -nF 'W1' plan.mdJUDGMENT (pure LLM reasoning):
- Tests appear before implementation in every section?
- No proposed code duplicates existing helpers?
- Every file manifest row traces to a requirement?
The mechanical checks are the most important. A placeholder grep that finds nothing is a pass. A grep that finds "TBD" is an automatic failure. No interpretation, no rubber-stamping.
Building /develop
The Core Insight: The Document Is the State
Every autonomous orchestration system needs to answer: where does state live? In the conversation context? In variables? In the agent's "memory"?
We put it on disk. A single markdown file — build/develop/{feature-name}.md — holds everything: the spec, the requirements, the plan, the execution log, the validation history. All agents read from it, write to it, and commit it.
This gives us two things for free:
- Mid-flight resume — if the conversation dies, re-invoke
/developand it picks up from where the document says it stopped - Thin orchestrator — the orchestrator holds only the roadmap (~30 lines of Execution Strategy) and phase summaries (1-2 lines each). Everything else is on disk.
The Living Document
# {Feature Name}
## Status: {Phase 0 | Planning | Executing | Validating (iter N) | Complete}
## Spec
{Original or generated spec text}
## Requirements
| # | Requirement | Layer(s) | Status |
## Plan
{Full plan output}
## Execution Strategy
{Extracted from plan — the orchestrator's roadmap}
## Execution Log
| Phase | Agent | Status | Summary | Commit | Handoff Notes |
## Validation History
### Iteration {N}
- Severity histogram: high={n}, medium={n}, low={n}
- Progress vs iter {N-1}: PROGRESS | NOT_PROGRESSPhase Structure
Phase 0: Setup — parse input, create branch, create doc, extract requirements
Phase 1: Plan — dispatch agent to run /plan-work, store Execution Strategy
Phase 2: Execute — run each phase from the strategy, verify each agent's output
Phase 3: Validate — loop: pre-filter → validator → convergence check → fix → repeat
Phase 4: Quality — domain-specific code reviewers
Phase 5: Push — commit, rebase, push, PRThe Agent Briefing Pattern
This is where we had to get most creative. How you brief agents determines whether the whole thing works or produces garbage.
What Every Agent Gets
- Path to the living document
- Which section to read for input
- Which section to update with output
- Instructions to commit after making changes
- Prior Phase Handoff Notes — what previous agents said downstream needs to know
The ASSUMPTIONS Line
Every agent must begin its reply with:
ASSUMPTIONS: <one-line restatement of what it thinks it was asked to do>
STATUS: DONE | DONE_WITH_CONCERNS | BLOCKED | NEEDS_CONTEXT | PARTIALThe ASSUMPTIONS line is the single highest-leverage signal. If the agent restates the task wrong, you catch it before reading the diff. Reject mismatched assumptions immediately and re-dispatch.
Input Narrowing for Verifiers
This is the pattern that took us longest to discover, and it's the one that matters most.
When you dispatch a verifier to check a generator's output, the verifier's brief must NOT include the generator's context — not the behavioral rules the writer expanded, not the codebase references it used, not the chain-of-reasoning it built up.
The verifier gets exactly:
- The output files being evaluated
- The requirements table
- The rejection-criteria format
Why? If the verifier sees the writer's rationalizations, it inherits them and rubber-stamps. The verifier needs to re-derive correctness from the artifact alone. Shared context is how generation and verification collapse into the same skill.
Model-Tier Separation
Related: when dispatching verifiers, use a different model tier than the generator. Writer on Opus? Verifier on Haiku or Sonnet. Writer on Sonnet? Verifier on Opus.
Different model checkpoints have different blind spots. It's a partial substitute for true task separation, and it's the only lever you have when both roles are LLMs.
Verify, Don't Trust
Agents are systematically optimistic. They report DONE even when files are stubs, methods don't compile, or wiring is missing. This isn't malice — it's the failure mode of LLMs working with isolated context.
After every phase, run four checks:
- Existence —
git diff --statshows the planned files actually changed - Compile — smallest build task for the changed module
- Stub scan — grep for
TODO,FIXME,NotImplementedErrorin production files - Wiring spot-check — grep registration files for new bindings
Cost: ~30 seconds per phase. Value: catches failures before they compound into Phase 3 validation.
Mechanical Pre-Filters
This is where we got the biggest bang for our buck. Before dispatching any LLM verifier, run project-specific greps on the changed files:
# Adapt these to YOUR stack's common mistakes:
grep -nE 'console\.log' src/ # debug logging
grep -nE '\bany\b' src/**/*.ts # TypeScript 'any'
grep -nE 'innerHTML|dangerouslySetInnerHTML' src/ # XSS vectors
grep -nE 'SELECT \*' src/ # unbounded queries
grep -rnE '\bTODO\b|\bFIXME\b' src/ # stubsIf any hit, dispatch a fix-agent without invoking the verifier. The pre-filter catches the class of failures where writer and verifier share the same blind spot — because grep doesn't rationalize.
Every time the system produces a bad result, ask: could a grep have caught this? If yes, add it to the pre-filters. Our list has grown steadily over time and it's one of the most effective parts of the system.
Convergence Gates
Without convergence tracking, validation loops can burn tokens forever. Each fix creates a new problem. Iteration 3 has the same number of findings as iteration 1, just different ones.
We track this with fingerprinting:
fingerprint_N = sorted set of (severity, file:line, rule) tuples
PROGRESS means:
high_count strictly decreases, OR
(high_count unchanged AND medium_count strictly decreases)
If iteration >= 2 AND NOT PROGRESS → escalate to userThis catches the failure mode where fixes produce symptom-variant findings that look new but aren't actually reducing the problem. The gate forces escalation to a human who can break the cycle.
The BLOCKED Decision Tree
When an agent reports STATUS: BLOCKED, never silently retry. Diagnose first:
- Missing context? → Provide it, re-dispatch same agent
- Reasoning failure? → Re-dispatch with a more capable model
- Reactive loop? (fix A breaks B, fix B breaks A) → Stop, escalate. The agent isn't stuck on capability — it's stuck on a feedback shape it can't escape.
- Scope problem? → Split the phase into smaller slices
- Plan/spec error? → Stop, escalate. The plan is wrong, not the agent.
- Unknown? → Stop, escalate
Same inputs = same outputs. Never re-dispatch without changing something.
Handoff Notes: The Cross-Phase Channel
This solved a problem that was subtle but kept biting us. Phase 2a renames a type. Phase 2c doesn't know about the rename and creates a compile error.
The fix: a Handoff Notes column in the Execution Log. Each agent writes anything downstream phases need to know:
- Unexpected file coupling it worked around
- Naming decisions downstream must follow
- Assumptions that need later validation
- Files touched outside the plan
Each new agent reads all prior Handoff Notes before starting. Information travels through the document, not through the orchestrator's summary — because summaries lose critical details.
Adapting to Your Stack
The patterns are stack-agnostic but the wiring checklist is stack-specific. Here's what it looks like for a few common setups:
Node.js / TypeScript
| W1 | Prisma model defined | prisma/schema.prisma | New entity |
| W2 | Migration generated | prisma migrate dev | After schema change |
| W3 | Route registered | src/routes/index.ts | New endpoint |
| W4 | Middleware applied | src/middleware/ | Auth/validation |
| W5 | Type exported | src/types/index.ts | New shared type |
| W6 | Page route added | src/app/router.tsx | New page |Python / Django
| W1 | Model created | app/models.py | New entity |
| W2 | Migration generated | python manage.py makemigrations | After model change |
| W3 | Admin registered | app/admin.py | New model |
| W4 | URL pattern added | app/urls.py | New view |
| W5 | App in INSTALLED_APPS | settings.py | New app |The specific wiring points differ but the pattern is identical: enumerate them all, check them mechanically, fail if any are missing.
Getting Started
Don't try to build both skills at once. Here's the order that worked for us:
Week 1: Start with the wiring checklist and a basic /plan-work. Create your wiring checklist (seriously, this is the most valuable thing). Write a feature plan template. Build a minimal /plan-work skill. Test it on a real feature and note what the plan gets wrong.
Week 2: Add verification. Add the placeholder scan. Add wiring checklist validation. Add your first project-specific pre-filter greps. Test on a bug fix and an improvement.
Week 3: Build the orchestrator. Write /develop with Phases 0-2. Add the "Verify, Don't Trust" checks. Test end-to-end on a small feature. Note which agent failures you hit.
Week 4: Add the validation loop. Add Phase 3 with convergence gates. Add model-tier separation. Add the BLOCKED decision tree. Add mid-flight resume.
Ongoing: tune from failures. Every bad result is a signal. Could a grep have caught it? Add a pre-filter. Was the plan vague? Add to anti-patterns. Was wiring missing? Add to the checklist. Did the verifier rubber-stamp? Check input narrowing.
The skills are never "done" — they evolve with your understanding of what goes wrong.
Summary
- Build the planner first. A solid plan prevents most agent coordination failures.
- The wiring checklist is your most valuable artifact. It prevents the most common failure mode mechanically.
- The document is the state. Not the conversation, not the orchestrator's memory — the file on disk.
- Verify, don't trust. Every agent claim gets checked before acceptance.
- Mechanical pre-filters before LLM verification. Grep catches what LLMs rationalize away.
- Input narrowing for verifiers. If the verifier sees the writer's reasoning, it rubber-stamps.
- Convergence gates prevent infinite loops. If findings aren't decreasing, escalate.
- Never re-dispatch with identical inputs. Diagnose first.
Hope it helps.