Or... A journey from naive enthusiasm to systematic enforcement in AI-assisted development
The Starting Point: Unbounded Optimism
I started working with Roo Code in February 2024 after encountering Reuven Cohen showing amazing automation with multiple customized modes. This was after of course trying Github Copilot and it's typeahead style support. I immediately recognize the potential and started working on developing my own modes knowing that I couldn't just let a coder, architect, and debugger loose on projects. After 30 years in IT, I recognize the value of guardrails and gates for producing quality with velocity. I struggled at first and never really got an effective flow that could compete with what I saw Reuven turning out. Then I got a job again and got deep into making things happen and my pace slowed.
Then Claude Code was born... I saw Reuven speaking about saving thousands monthly due to the (at the time) unlimited plan. I saw him and his group among surely other innovators burn so many tokens Anthropic had to pull back. But I still didn't have time to try it out yet.
Finally...I had a need again and some spare time. So I started working with Claude Code in September 2024 with the kind of optimism that comes from not knowing what you’re getting into with only one agent to talk to but plenty of stories from the LI and lots of expectations. Surely Anthropic learned a lot of the same lessons Reuven and peers were teaching and making freely available. The models had evolved rapidly. Everyone was talking about how awesome it was. It should be a solid evolution beyond the Roo Code experience, right?
The goal seemed straightforward: build an AI assistant that could process documents locally, search them semantically, and provide intelligent answers. Privacy-first, no cloud dependencies, pure local execution on Apple Silicon.
Claude was amazing. We built out 238 Python files across 10 production modules. We wrote 773 passing tests with comprehensive coverage. We created a complete FastAPI backend with a React frontend. In 40 days, we went from empty repository to a fully functional AI assistant with web UI, particularly gaining speed once trust and excitement caused me to invest beyond the $20 monthly to the $200 monthly so I didn't have to wait as long between steps.
The code worked. The tests passed. The architecture was modular and elegant. Success, right?
Not quite.
First Wake-Up Call: The Post-Mortem
When I sat down to analyze what we’d actually built, the problems became visible. Not code quality problems - Claude writes pretty solid code. The problems were in the process.
Discovery #1: Reviews Were Happening Backward
Just like in my Roo Code experiments, I’d implemented “agent-based governance” - having AI review the work to ensure quality. Assessor Agent checked strategic alignment. QA Agent verified technical quality. Tattle-Tale critiqued both reviews. It felt thorough.
But then I noticed timestamps in the review files:
"PROCESS DEVIATION NOTED: This QA assessment was conducted AFTER task completion instead of during development as required by CLAUDE.md governance workflow."
We were reviewing after implementing. Like a restaurant health inspector showing up after the meal is served. The reviews caught issues, but we’d already built the wrong thing.
Expected impact: 40% reduction in rework if reviews happened before coding.
Discovery #2: Scope Creep Was Massive
The data standardization task was supposed to be “basic data structures and serialization.” What we delivered was a complete data model ecosystem with 11 models. That’s 300%+ scope expansion.
The integration health module was supposed to track mock debt. We ended up building a comprehensive health dashboard with multiple metrics, historical trending, and automated migration planning.
When AI says “while I’m here, let me also add…” it sounds helpful. But those helpful additions compound. A 2-file task becomes a 7-file refactor. A 3-hour feature becomes a full-day implementation.
Expected impact: 60-90 minutes saved per task by catching overruns at 50% instead of 100%.
Discovery #3: Context Was Drowning in Noise
Starting a new session required reading 485 lines across multiple session files. It took 10+ minutes just to understand where we left off. Quick-resume didn’t exist yet. Historical context had no structure.
Finding information meant grepping through text files for 30-60 seconds, getting 200 noisy matches, and manually filtering for relevance.
Expected impact: 80% faster session startup (2 minutes vs 10 minutes).
Discovery #4: The Token Burn Was Unsustainable
This was the most shocking discovery. I did a detailed analysis of token usage in the deployed workflow:
- Agent reviews: 91KB per work unit (22,000 tokens)
- Work unit with embedded reviews: 9.8KB (2,500 tokens)
- Reading reviews in session: ~5,000 tokens
- Re-review on minor revision: ~15,000 tokens
Total: ~44,500 tokens per 2-3 hour work unit.
Compared to the old workflow with no formal reviews, V2 was burning 9x more tokens. At scale, this would cost 624 annually.
The reviews themselves were enormous. One assessor review was 1,319 lines. It quoted extensively from the work unit, reproduced code snippets, provided multi-paragraph analyses, and included evidence sections. The QA agent wrote 425-line reports for a simple work unit.
Root cause: The agent templates said “assess thoroughly” so agents interpreted that as “write everything you can think of.”
The First Major Revision: V2.0 Workflow
With these lessons in hand, I completely redesigned the workflow. The core insight: voluntary discipline doesn’t work with AI. You need external enforcement.
Change #1: Git Hooks Enforce Reviews
The most critical change was simple: make the pre-commit hook block commits unless reviews exist.
# Can't forget this because git blocks you
vision_review=$(find .claude/agent-reviews/ -name "vision-*.md" -newer .git/refs/heads/main)
scope_review=$(find .claude/agent-reviews/ -name "scope-*.md" -newer .git/refs/heads/main)
# ... checks for all required reviews
if [ -z "$vision_review" ]; then
echo "❌ Missing Vision Agent review"
exit 1
fi
Before this, I would sometimes skip reviews. “It’s just a small change.” “I know what I’m doing.” “Reviews take too long.” Every time, I regretted it later.
After this, skipping reviews became impossible. The commit simply blocks. You can’t forget what’s externally enforced.
Result: 60% compliance → 100% compliance.
Change #2: Work Unit Discipline with Scope Tracking
Every task became a “work unit” with explicit boundaries:
## Objective
Add email validation to user registration endpoint with tests
## Success Criteria
- [ ] Email regex validation (RFC 5322 compliant)
- [ ] Returns 400 with clear error message for invalid emails
- [ ] Unit tests cover valid/invalid/edge cases
## Expected Scope
- Files: [api/routes/users.py, tests/test_users.py]
- File count: 2
## Validation Command
pytest tests/test_users.py::TestEmailValidation -v
The expected file count became the key metric. The pre-commit hook compares actual files changed to expected:
⚠️ SCOPE ALERT: 67% file count variance
Expected files: 3
Actual files: 5
Options:
1. Unstage extra files (split into new work unit)
2. Update expected scope (justify expansion)
3. Proceed anyway (document why)
You must consciously decide. You can’t accidentally expand scope.
Result: Scope creep becomes visible in real-time, not discovered weeks later.
Change #3: Quick-Resume for Fast Session Startup
Instead of reading hundreds of lines of session logs, quick-resume provides a one-page summary:
## Current Work Unit
Objective: Add POST /api/auth/login endpoint
Expected scope: 3 files (route, schema, test)
## Agent Review Status
- Vision: APPROVED
- Scope: APPROVED (with minor scope reduction suggestion)
- Design: APPROVED
## Last Completed
[Unit Complete] Add user model with password hashing (2025-10-08)
## Next Actions
1. Implement login endpoint per work unit spec
2. Stay within 3-file boundary
3. Run validation command before committing
The git hook regenerates this on every commit. Session startup dropped from 10 minutes to 2-3 minutes.
Change #4: SQLite History for Fast Context Retrieval
Instead of grepping through text files:
# Find work units related to authentication (1-2 seconds)
sqlite3 .claude/memory.db "SELECT * FROM v_work_unit_summary
WHERE objective LIKE '%auth%'
ORDER BY created_at DESC"
# Check agent compliance rate across all work
sqlite3 .claude/memory.db "SELECT * FROM v_agent_compliance"
# Analyze scope creep patterns
sqlite3 .claude/memory.db "SELECT * FROM v_scope_creep
WHERE alert_level = 'red'"
Result: 30-60 seconds grep → 1-2 seconds SQL query with better results.
Second Wake-Up Call: Token Economics
The V2.0 workflow solved process problems but created a new one: cost. Those massive agent review files were burning tokens at an unsustainable rate.
I did another analysis and found the issues:
- Agent verbosity was unbounded - “Assess thoroughly” meant 1,300 lines instead of 50
- Reviews duplicated into work units - Embedding summaries bloated artifacts
- No review lifecycle - Old reviews accumulated forever
- Re-reviews were expensive - Minor scope adjustments triggered full re-analysis
The solution came in phases:
Phase 1: Constrain Agent Length
Updated every agent template with hard limits:
## Output Format Requirements
**CRITICAL: Keep response under 50 lines total**
Provide:
- **Alignment**: ALIGNED / MISALIGNED (2-3 sentences max)
- **Scope**: APPROPRIATE / TOO LARGE / TOO SMALL (2-3 sentences max)
- **Risk Level**: LOW / MEDIUM / HIGH (3-5 bullet points max)
- **Recommendation**: APPROVE / REVISE / REJECT (1 sentence)
DO NOT:
- Quote extensively from work unit (reference line numbers instead)
- Reproduce code snippets (describe issues instead)
- Provide multi-paragraph analysis (use bullets)
Phase 2: Seven Specialized Agents
Instead of three generalist agents (Assessor, QA, Tattle-Tale), I created seven specialists:
- Vision Alignment - Right problem?
- Scope Control - Right size?
- Design Effectiveness - Right patterns?
- Code Simplicity - Simplest approach?
- Testing Strategy - Adequate tests?
- Validation - Testable criteria?
- Tattle-Tale - Critique all six reviews
Each agent gets 50 lines to focus on ONE concern. The Tattle-Tale gets 80 lines to review all six.
Math: 6 specialists × 50 lines + 1 critic × 80 lines = 380 lines total
Comparison: 380 lines vs 500-900 lines (58% reduction)
But here’s what I didn’t expect: Specialization improved coverage.
- Design effectiveness: 20% → 80-85% (+300%)
- Code simplicity: 10% → 80-85% (+700%)
- Vision/scope maintained: 60-70% → 80-85%
When an agent only has to think about ONE thing for 50 lines, it thinks deeper about that thing than when it tries to cover everything in 300 lines.
Result: Better reviews, lower cost.
Phase 3: Review Lifecycle Management
Added automatic archival:
# After work unit completion, move old reviews to archive
find .claude/agent-reviews -name "*.md" -mtime +7 \
-exec mv {} .claude/agent-reviews/completed/$(date +%Y-%m)/ \;
Result: 80% reduction in loaded context per session.
Third Wake-Up Call: Reviews Weren’t Changing Behavior
This was the most humbling realization. I’d built this elaborate review system. Git hooks enforced the reviews. Seven specialized agents provided deep analysis. Reviews were happening before implementation.
But then I actually read what the agents were finding:
Vision Agent: "CRITICAL - This work unit introduces tight coupling between the embedder and search modules that will require significant refactoring later."
Scope Agent: "P0 ISSUE - Expected 3 files but success criteria imply changes to 5+ files. Scope boundaries unclear."
Simplicity Agent: "CRITICAL - Using random embeddings for testing will not catch real integration failures. This is a testing anti-pattern."
These weren’t minor suggestions. These were critical issues flagged before implementation. Issues that would have saved hours of debugging if addressed.
And what did I do? Committed the work unit as-is and implemented anyway.
The reviews were documentation, not enforcement. Like warning signs on a road - visible but not binding. I’d read them, think “yeah, I should probably address that,” and then… forget about it the moment I started coding.
The problem: Reviews were advisory, not actionable.
The Final Evolution: V2.1 Behavioral Improvements
This led to the final major change: making reviews actually change behavior. Not through voluntary discipline (which clearly doesn’t work), but through external enforcement at two critical checkpoints.
Checkpoint 1: Structured Response to P0 Findings
After the seven-agent review sequence, a new script runs:
python3 .claude/scripts/generate_response_template.py
This parses all seven reviews for P0/CRITICAL findings and generates response.md:
# Response to P0 Findings
## P0 #1: Scope Control Agent
**Finding**: Expected 3 files but success criteria imply 5+ files
→ ACTION: [What you did to fix it]
→ JUSTIFY: [Why proceeding anyway]
## P0 #2: Simplicity Agent
**Finding**: Random embeddings won't catch real integration failures
→ ACTION: [What you did to fix it]
→ JUSTIFY: [Why proceeding anyway]
The pre-commit hook then blocks the commit unless:
- response.md exists
- Every P0 has either ACTION or JUSTIFY filled in (not placeholders)
- You’ve consciously addressed every critical issue
You can still proceed with the issues, but you must consciously decide and document why.
Result: P0 findings no longer get ignored by accident.
Checkpoint 2: Implementation Compliance Gate
After implementation, before the commit completes, three heuristics run:
Heuristic 1: Workaround Detection
# Check for TODO, FIXME, HACK, WORKAROUND in staged code
if git diff --cached | grep -E "(TODO|FIXME|HACK|WORKAROUND)[: ]" > /dev/null; then
echo "⚠️ Workaround indicators detected"
run_compliance_check=true
fi
Heuristic 2: Scope Variance
# Check if file count exceeded expected by >50%
expected_files=3
actual_files=5
variance=$(( (actual_files - expected_files) * 100 / expected_files ))
if [ $variance -gt 50 ]; then
run_compliance_check=true
fi
Heuristic 3: Missing Tests
# Check if code changed but no test changes
code_changes=$(git diff --cached --name-only src/ | wc -l)
test_changes=$(git diff --cached --name-only tests/ | wc -l)
if [ $code_changes -gt 0 ] && [ $test_changes -eq 0 ]; then
run_compliance_check=true
fi
If any heuristic flags, the Implementation Compliance Agent runs:
# Analyzes via Claude API:
# - Are workarounds justified or technical debt?
# - Is scope expansion necessary or creep?
# - Are tests missing or adequate for changes?
#
# Returns: COMPLIANT / CONCERNS / NON-COMPLIANT
The analysis is saved to an audit trail. If NON-COMPLIANT, you’re prompted:
⚠️ Implementation flagged as NON-COMPLIANT
Issues:
- TODO comments indicate incomplete implementation
- Scope expanded 67% without justification
- No test coverage for new validation logic
Proceed anyway? [y/N]
Again, you can proceed. But you must consciously decide.
Token Economics:
- 90% of commits: 0 tokens (heuristics pass)
- 10% of commits: ~2,000 tokens (agent runs)
- Average: ~200 tokens per commit
That’s 4% overhead per work unit for early detection of problems that would take hours to debug later.
The Instruction Refresh Pattern: Accepting Claude’s Limitations
There was one more problem that took me too long to recognize: Claude forgets.
Not in a “context window” sense - the instructions were technically in context. But in practice, after a long conversation, Claude would:
- Skip work units and start implementing directly
- Forget to check expected file counts
- Not ask for reviews before starting
- Ignore established patterns
I kept updating CLAUDE.md with “NEVER skip reviews” and “ALWAYS create work units first” and increasingly emphatic language. It didn’t help.
The breakthrough was accepting reality: voluntary memory doesn’t work for AI any more than voluntary discipline works for humans.
The solution uses git commits as natural checkpoints:
After every commit, the pre-commit hook generates a fresh reminder file (.claude/sessions/commit-reminder.md) and displays:
✅ Pre-commit checks complete
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📋 NEXT: Prompt Claude to refresh instructions
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
After this commit completes, ask Claude:
"Refresh workflow instructions"
I prompt “Refresh workflow instructions” and Claude immediately reads:
- commit-reminder.md (fresh workflow summary)
- current_work_unit.md (current boundaries)
- quick-resume.md (session context)
Claude re-orients. The process continues.
This works because it’s honest about limitations and uses external checkpoints rather than relying on sustained attention.
Lessons Learned: What Actually Works
After 40+ days on the AI Assistant project and several iterations of workflow improvement, here’s what I’ve learned about AI-assisted development:
1. AI Amplifies Your Process (Good or Bad)
If your process is “start coding and figure it out as you go,” AI will help you code yourself into corners faster than you could alone.
If your process is “define boundaries, validate approach, implement within constraints,” AI will help you build high-quality software faster than traditional development.
AI doesn’t add discipline. It requires discipline to be effective.
2. External Enforcement > Voluntary Discipline
I tried voluntary discipline first:
- “Remember to create work units” (forgot 40% of the time)
- “Keep work units small” (scope creep in 50%+ of cases)
- “Get reviews before implementation” (reviews happened after in 60% of cases)
External enforcement via git hooks:
- Can’t commit without work unit
- Can’t commit without reviews
- Can’t ignore P0 findings
- Can’t miss scope variance
Result: 100% compliance on critical workflows.
3. Specialization > Generalization for AI Reviews
Three generalist agents trying to cover everything:
- Design effectiveness: 20% coverage
- Code simplicity: 10% coverage
- 500-900 lines per work unit
Seven specialist agents each focused on one concern:
- Design effectiveness: 80-85% coverage
- Code simplicity: 80-85% coverage
- 380 lines per work unit (58% reduction)
Counterintuitive but validated by data: More agents with narrower focus = better analysis at lower cost.
4. Checkpoints > Continuous Monitoring
I tried having Claude “stay aware” of boundaries throughout implementation. This doesn’t work. Attention drifts during coding.
What works: Checkpoints where you must stop and evaluate:
- Before starting: Review sequence (10-15 min)
- After work unit: P0 response requirement (commit blocks)
- After implementation: Compliance heuristics (commit checks)
These forced pauses prevent “flow state” from becoming “scope creep state.”
5. Make the Invisible Visible
Scope creep is invisible until you measure it:
- Expected: 3 files
- Actual: 5 files
- Variance: 67% expansion
Token burn is invisible until you analyze it:
- Per work unit: 44,500 tokens
- Per week: ~$13.35
- Per year: ~$624
Once visible, problems become addressable. Before that, you’re optimizing blind.
6. Accept AI’s Limitations
Claude and Codex and Roo Code forget instructions during long conversations. This is a constraint, not a failure.
Trying to fix this with more emphatic instructions doesn’t work. Building around it with external checkpoints and refresh patterns does work.
The instruction refresh pattern after every commit takes 30 seconds and maintains alignment throughout the session.
7. Speed Isn’t the Goal, Sustainable Pace Is
AI can write code very fast. This is dangerous.
Without guardrails, fast code becomes:
- 8-hour coding marathons
- Scope creep every session
- Technical debt accumulation
- Eventual burnout and massive cleanup
With guardrails:
- 1-4 work units per hour (natural stopping points)
- Sustainable velocity over weeks
- Technical debt caught early
- Code you can maintain
Current State: V2.1 Behavioral in Production
As of October 2025, the V2.1 workflow with behavioral improvements is deployed and working. The metrics tell the story:
Problems Solved
Problem | Before | After |
Agent reviews forgotten | 60% compliance | 100% (git enforced) |
Scope creep invisible | 50%+ expansion | Real-time alerts |
Session startup slow | 10 minutes | 2-3 minutes |
Context retrieval slow | 30-60 sec grep | 1-2 sec SQL |
P0 findings ignored | Often | Impossible (commit blocks) |
Implementation debt | Invisible | Detected by heuristics |
Token burn | 44,500/unit | ~14,600/unit (67% reduction) |
What This Enables
Quality: 100% review compliance with systematic quality gates at multiple checkpoints
Velocity: Fast session startup, fast context retrieval, early issue detection preventing hours of rework
Sustainability: Small work units (1-5 files), clear boundaries, sustainable pace (1-2 units/day), queryable history
Cost: 67% reduction in token usage while improving review depth and coverage
The Philosophical Shift
When I started this journey, I thought AI-assisted development was about leveraging AI capability.
I now understand it’s about constraining AI behavior.
AI wants to help. This sounds good until you realize “help” means:
- “While I’m here, let me also add…”
- “This would be better if we also handled…”
- “I notice we could improve…”
Unconstrained helpfulness becomes scope creep. Well-intentioned additions become technical debt. Fast coding becomes unsustainable pace.
The guardrails I built aren’t about limiting AI. They’re about channeling AI capability into sustainable, high-quality development.
For Other non-Developers who want to build: What to Take From This
You don’t need to adopt my entire workflow. But if you’re doing AI-assisted development, consider:
1. Define Boundaries Explicitly
Not “improve the user system” but:
Objective: Add email validation to registration endpoint
Files: [api/users.py, tests/test_users.py]
Count: 2
Duration: 2-3 hours
Small, bounded, measurable.
2. Enforce Critical Rules Externally
Don’t rely on AI (or yourself) to remember. Make the tools enforce what matters:
- Git hooks that block bad commits
- Scripts that validate before proceeding
- Automated checks at key decision points
If it’s important, make it impossible to forget. And most importantly, ask the AI to think like the role to define those assessment prompts. None of us are gurus at everything but asking questions to learn what's missing works. Asking "stupid" questions helps you learn. Once you learn you "know that you do not know", ask the AI to build it. Then have the tattle-tale review the work to steer the AI back toward balanced results instead of taking it's emphatic and excited, least cost (Anthropic or OpenAI's concern) response.
3. Build in Checkpoints
Don’t try to “stay aware” throughout long coding sessions. Build forced pauses:
- Before starting (validate approach)
- During work (scope variance alerts)
- Before committing (compliance checks)
These breaks prevent flow from becoming drift.
4. Make Costs Visible
Track token usage. Measure scope variance. Log time spent. Analyze patterns.
You can’t optimize what you don’t measure.
5. Accept AI’s Limitations
Claude forgets instructions. ChatGPT expands scope. GitHub Copilot suggests patterns inconsistent with your architecture.
This isn’t failure - it’s the nature of the tools. Build your process to work with these limitations, not fight them.
Final Thoughts
The goal isn’t for everyone to use my exact workflow. The goal is to share what I learned about the distance between “AI can write code” and “AI can help me build sustainable software.”
Building software with AI assistance is incredibly powerful. Claude helped me create a 238-file, production-ready AI assistant in 40 days. That’s genuinely impressive.
But it didn’t happen through unbounded AI capability. It happened through systematic constraints:
- Work units keep tasks small
- Git hooks enforce process
- Specialized agents review deeply
- Scope tracking prevents expansion
- Checkpoints maintain alignment
- External enforcement prevents forgetting
The irony is that the most powerful use of AI requires the most rigorous guardrails.
AI isn’t a miracle worker, but it can amplify your efforts when guided thoughtfully. The question isn’t whether to use AI - it’s how to channel its capability into sustainable development.
The workflow I built is my answer to that question. It’s not perfect, but it’s honest about AI’s limitations and designed to work with them rather than against them.
Maybe some of these lessons will help you build your own answer.
Steve Genders writes about risk management, software engineering, and the intersection of human judgment and AI assistance at riskjuggler.info. The complete workflow infrastructure described in this post is available at github.com/YourRepo/improvingclaudeworkflow-v2.
No comments:
Post a Comment