aka The Embarrassing Git Diff
Six months ago, I exported the chat history from my first attempt at the standard AI chat app project building with Roo Code to see what I'd actually built. The merged transcript was 15,000 lines long. The working code? About 2,000 lines. The ratio bothered me.
But what really made me wince was reading my own prompts.
"Got an error. Please investigate and repair."
"Please get re-oriented."
"How do I get Python to be my default interpreter for VS Code on MacOS?"
I wasn't engineering with AI. I was having a conversation with a very patient tutor who happened to write code. And while I got a working prototype, the process took 40 days of back-and-forth. I'd built something, but I hadn't learned how to build with AI efficiently.
Fast forward to last week. I built a production-ready web chat application—React frontend, Express backend, LM Studio integration, 177 tests, E2E coverage, complete documentation—in 2.55 hours of actual agent work time. Not 40 days. Not 40 hours. 2.55 hours.
What changed wasn't the AI. What changed was me.
The Tale of Two Projects: Same Goal, Different Worlds
Project A: The 40-Day Prototype (March 2025)
What I asked for: A basic AI chat agent. Everyone's first AI Engineering project.
My prompts looked like this:
- "Update the design document for the OpenAI chat app to enable the user to be presented with the OpenAI model options..."
- "Enhance the setup environment script to handle errors and exceptions."
- "I believe I have my python3 setup now. Please verify the environment script uses the python3 command."
- "Got an error. Please investigate and repair. I also just installed git."
What actually happened:
- Constant context loss ("Please get re-oriented. I want to return to working on the diagram building agent...")
- Iterative debugging cycles (error → investigation → repair → new error)
- Tool configuration questions ("How do I get Python to be my default interpreter?")
- Design document reviews to check if code matched requirements
Result: Working prototype, 773 tests passing, but 44,500 tokens per work session just to remind the AI what we were building.
Time cost: 40 days of conversations. Unknown actual implementation hours (buried in chat history).
My role: Project manager, debugger, environment troubleshooter, requirements clarifier, documentation writer, integration tester, and occasional coder.
Project B: The 2.5-Hour Production App (2025)
What I asked for: A localhost web chat application with React, Express, and integration to my llm_caller_cli to leverage it's LM Studio integration option.
My prompts looked like this:
- Initial: "Analyze this codebase and create a CLAUDE.md file for the web chat application project. Set up the V2.6 workflow."
- Strategic: "Where are the other tasks you can run in parallel?"
- Quality control: "Have the agents review the plans"
- Correction: "Did you subtract out the time you were waiting on me?"
What actually happened:
- AI agent created 25 work units, each with 7-agent reviews (Vision, Scope, Design, Simplicity, Testing, Validation, Tattle-Tale)
- Define-and-deploy agents autonomously executed: definition → plan review → implementation → output review → commit
- Parallel execution: Orchestrator workflow that delivered 10 work units simultaneously in Sprint 2 (2 independent tracks)
- Post-commit automation: Test results, status updates, documentation all auto-generated
Result: Production-ready application with zero P0 blockers after remediation, 177 tests passing, E2E coverage, deployment documentation.
Time cost: 2.55 hours of agent work across 25 work units. Total elapsed: 22.24 hours (including sleep and my review time).
My role: Vision setter (5 minutes), strategic director (5 minutes), quality validator (5 minutes).
Total hands on time: ~15 minutes.
The Five Levels of AI Engineering (I Went from 1 to 4)
Looking back, I can now see distinct maturity levels in how people work with AI:
Level 1: The Tutorial Student
Prompt pattern: "How do I...?"
AI role: Teacher
Result: You learn, but slowly
Example: "How do I get Python to be my default interpreter for VS Code on MacOS?"
Level 2: The Interactive Debugger
Prompt pattern: "Got an error. Please investigate."
AI role: Debugging partner
Result: Code works eventually
Example: "Please investigate and repair. I also just installed git."
I was here in Project A.
Level 3: The Requirements Manager
Prompt pattern: "Implement X with Y constraints."
AI role: Developer following your spec
Result: Features get built
Example: "Update the design document for the openAI chat app to enable the user to be presented with the OpenAI model options..."
I was still mostly here in Project A.
Level 4: The System Architect
Prompt pattern: "Here's the vision. Execute with these quality gates."
AI role: Autonomous delivery team
Result: Systems get built
Example: "Set up the V2.6 workflow with sprint orchestration and define-and-deploy agents."
I reached here in Project B.
Level 5: The Meta-Engineer
Prompt pattern: "Optimize the workflow itself."
AI role: Self-improving system
Result: Processes get better
Example: I'm not here yet, but the path is clear.
What Actually Changed: The Six Realizations
1. Workflows Matter More Than Prompts
Old approach: Better prompts = better results
New approach: Better process = better results
The V2.6 workflow I used in Project B isn't magic. It's just:
- Structured work units (small, clear deliverables that don't expect too much of Claude)
- Quality gates (7 AI agents review plans before implementation and code after)
- Automated tracking (git hooks update status after every commit)
- Parallel execution (independent work units run simultaneously)
But that structure eliminated 93% of my context-loading overhead. Instead of re-explaining the project every session (44,500 tokens), I read an 80-line status.json file (4,000 tokens).
Token economics matter. When you're paying per-token, efficiency isn't just speed—it's cost.
2. AI Autonomy Requires Human Structure
Paradox: The more structure you provide upfront, the more autonomous AI can be.
In Project A, I gave vague direction and provided constant correction. The AI needed me for every decision because I hadn't defined the boundaries.
In Project B, I spent 5 minutes defining vision (localhost MVP, React+Express, LM Studio, no auth/database). The AI then executed 25 work units with 96% autonomy.
User inputs in Project A: Hundreds of prompts over 40 days
User inputs in Project B: 15 strategic decisions over 22 hours
The difference? I learned to set constraints, not just goals.
3. Quality Gates Are Cheaper Than Rework
Project A pattern: Build → discover issues → fix → discover new issues → fix again
Project B pattern: Review plan → fix issues → build once → verify → ship
In Project B, every work unit went through 7-agent reviews before implementation and after build:
- Vision Agent: "Does this solve the right problem?"
- Scope Agent: "Is this too big?"
- Design Agent: "Is this architecturally sound?"
- Simplicity Agent: "Is this the simplest approach?"
- Testing Agent: "How will we verify this?"
- Validation Agent: "What are the success criteria?"
- Tattle-Tale Agent: "Are the other agents being honest?"
This felt like overhead at first. But it caught 4 P0 blockers (including a command injection vulnerability) before they made it to production.
Cost of prevention: 7 agent reviews per work unit (15 seconds each)
Cost of fixing in production: Cost to discover, exposure while it wasn't known, cost to discuss between InfoSec or Product Management or the business or support, cost to replan the change, cost to develop the change, cost to move the change through change management, cost to test in staging, repeat UAT?, cost to deploy.
Quality gates aren't overhead. They're insurance.
4. Metrics Change Behavior
Project A: No visibility into cost, time, or efficiency
Project B: Every work unit tracked (time estimate, actual time, P0/P1/P2 issues)
When I could see that Sprint 2 was running sequentially when it could run in parallel, I asked: "Where are the other tasks you can run in parallel?"
That one prompt saved 1.5 hours (40% time reduction).
When the final report claimed we were 37% over budget, I asked: "Did you subtract out the time you were waiting on me?"
That correction revealed we were actually 86% under budget.
You can't optimize what you don't measure. And in Project B, everything was measured.
5. Trust Is Earned Through Verification
Bad trust: "The AI said it's done, so it's done."
Good trust: "The AI says it's done. Let me verify with evidence."
In Project B, I trusted the define-and-deploy agents to implement work units autonomously. But I also requested a post-hoc review after the release.
That review discovered 4 P0 blockers:
- P0-1: Command injection vulnerability (used `exec` instead of `spawn`)
- P0-2: Browser-blocking `window.confirm()` in production code
- P0-3: False "Zero Bugs" claim in documentation
- P0-4: False "Production Ready" claim in release notes
The AI had completed the work, but hadn't caught these issues because I hadn't added Security and UX review agents to the workflow.
Lesson: Trust the process, but verify the output. Autonomous doesn't mean unsupervised.
6. The Best Code Is the Code You Don't Write
Project A word count: 15,000 lines of chat, 2,000 lines of code
Project B breakdown :
Code written:
- Production code: 1,294 lines (525 server + 769 client)
- Test code: 4,277 lines (2,009 server + 2,268 client, 177 tests)
- Test results (177 tests, 100% E2E coverage)
- Configuration: 190 lines (tsconfig, package.json, vite, tailwind)
- **Total code: 5,761 lines**
Documentation written:
- Project docs: 1,982 lines (README, deployment checklist, known limitations, release notes)
- Analysis reports: 18,150 lines (audits, comparisons, security findings, quality metrics)
- Work units: 1,032 lines (25 work unit plans)
- Agent reviews: 1,028 lines (7-agent quality gates × 25 units)
- Workflow docs: 202 lines (status tracking, session resumes)
- Status tracking (auto-updated after every commit)
- 22,394 lines of documentation that provides complete audit trail and quality evidence
- Complete documentation (README, API docs, deployment guides)
- **Total documentation: 22,394 lines**
Code-to-documentation ratio: 1:3.9 (unusually high documentation—most projects are 10:1 code-to-docs)
In Project A, I spent 40 days generating a massive context log.
In Project B, I spent 15 minutes providing strategic direction.
I didn't write more or better prompts. We (Claude and I) built a system that didn't need them.
The Numbers: Why This Matters
Let me show you the efficiency delta in cold, hard metrics:
Time Investment
Metric | Project A (2025) | Project B (2025) | Improvement |
Elapsed calendar time | 40 days | 22.24 hours | 43x faster |
Active implementation time | Unknown (buried in logs) | 2.55 hours | Measurable vs unmeasurable |
User active input time | Hours per day | 15 minutes total | ~100x less |
Context loading per session | 44,500 tokens | 4,000 tokens | 93% reduction |
Cost Economics (at $150/hr developer rate)
Metric | Project A | Project B | Savings |
Estimated implementation cost | $7,200 (48 hrs × $150) | $383 (2.55 hrs × $150) | $6,817 saved |
Actual delivery time | 40 days | <1 day | 40x faster |
User oversight cost | Hours × $150/hr | 15 min × $150/hr = $37.50 | Negligible |
Quality Metrics
Metric | Project A | Project B | Change |
Tests | 773 passing | 177 passing | Fewer tests, better coverage |
Test types | Unit only | Unit + Integration + E2E | Comprehensive |
P0 blockers at release | Unknown | 0 (after remediation) | Verified safe |
Documentation completeness | Partial | Complete (22,394 lines) | Production-ready |
Known vulnerabilities | Unknown | 0 (security review caught) | Verified secure |
Autonomy Metrics (Project B only)
Metric | Value |
Agent autonomy | 96% |
User strategic input | 4% |
User leverage ratio | 10:1 (10 hours agent work per 1 hour user time) |
Work units executed in parallel | 10 (Sprint 2) |
Time saved by parallelization | 1.5 hours (40% reduction) |
The Uncomfortable Truth: I Was the Bottleneck
Here's what I didn't want to admit after Project A:
The AI wasn't the problem. I was.
- I asked vague questions → got vague answers
- I didn't define boundaries → got scope creep
- I didn't structure the work → got tangled dependencies
- I didn't automate tracking → lost visibility
- I debugged interactively → wasted time on iterations
The AI did exactly what I asked. The problem was I didn't know how to ask effectively.
Project B forced me to confront this because the V2.6 workflow requires precision:
- Work units must be small (1-5 files, single clear objective)
- Acceptance criteria must be testable
- Dependencies must be explicit
- Quality gates must be defined upfront
I couldn't be vague anymore. The system demanded clarity.
And that clarity—more than any prompt engineering trick—is what delivered a production app in 2.55 hours.
What This Means for You (Three Takeaways)
1. Stop Optimizing Prompts, Start Optimizing Process
The difference between 40 days and 2.5 hours wasn't better prompts. It was:
- Structured work units = small enough chunks of work that the AI could handle
- Automated quality gates requiring challenge to reduce hallucinations, overshooting
- Parallel execution for optimal velocity
- Verification workflows to confirm functionality wasn't broken by a change and log issues when it did
Action: Before writing your next prompt, ask: "What structure would make this prompt unnecessary?"
2. Your Time Should Be Strategic, Not Tactical
In Project B, my 15 minutes of input delivered 2.55 hours of agent work (10:1 leverage).
My inputs were:
- Vision definition (5 min)
- Quality gate triggers (5 min)
- Accuracy corrections (5 min)
I didn't write code. I didn't debug. I didn't configure environments. Like an effective manager working with a team of human experts, I set direction and enforced standards and built quality checks to assure we got support to anyone who needed it quickly.
Action: If you're spending more than 20% of your time on tactical AI interactions, you need better automation.
3. Quality Gates Are Your Competitive Advantage
The 4 P0 blockers discovered in Project B (command injection, blocking UI, false documentation) would have cost far more to remediate in production.
The 7-agent reviews that caught them? Cost: ~15 seconds per work unit.
Return on investment: ~1000x.
Action: Add review agents to your workflow. Start with: Security, UX, Scope, Testing. Consider Elegance, Code Quality, and just plain Critic. Let them challenge the AI's draft plans before implementation and the code just added because any missed issue or rabbit trail magnifies - just like with human workflows.
The Path Forward: What I'm Building Next
I'm definitely not done learning. Project B showed me what's possible, but also where I'm still falling short:
Current limitations:
1. I still need to manually trigger some post-hoc reviews (should be automatic but Claude still forgets)
2. Security and UX agents aren't in the default workflow (must be)
3. Parallel execution requires manual identification (could be automatic)
4. Quality gate enforcement is reactive (should be proactive blocking)
Next evolution:
- MEMORY!!! Claude keeps repeating errors. Let's add some related experience to the context
- Orchestrator and Sprint Planner
The goal is not to replace human judgment. It's to reserve human judgment for the 4% of decisions that actually matter.
The Question That Changed Everything
Six months ago, after finishing Project A, I asked myself:
"Why is this so complicated? Am I micro-managing when I don't need to? How can we do this better?"
I thought the answer was better prompts. Faster typing. More ChatGPT Plus or Claude credits.
The real answer was embarrassingly simple:
Stop having conversations. Start building systems.
Systems that:
- Define clear boundaries (work units, not projects)
- Enforce quality gates (reviews before implementation)
- Measure everything (time, cost, P0 count)
- Automate repetition (git hooks, status updates)
- Parallelize when possible (independent work = simultaneous execution)
Project A was a 40-day conversation with an AI.
Project B was a 2.5-hour system execution.
The AI didn't get smarter. I got more systematic.
And if you're reading this thinking, "But I don't have access to the V2.6 workflow," you're missing the point.
The workflow isn't the innovation. The discipline is.
You can apply these principles with any AI tool (and with the humans trusting you to lead):
1. Define work in small, testable units
2. Review complex plans before implementing
3. Simplify status tracking
4. Measure time and cost to help everyone see when to improve
5. Run parallel work whenever possible to optimize velocity
You don't need my exact setup. You just need the mindset shift.
No comments:
Post a Comment