Nov 5, 2025

From "Got an error. Please investigate" to Building a Production App in 2.5 Hours: My AI Engineering Evolution

aka The Embarrassing Git Diff

Six months ago, I exported the chat history from my first attempt at the standard AI chat app project building with Roo Code to see what I'd actually built. The merged transcript was 15,000 lines long. The working code? About 2,000 lines. The ratio bothered me.

But what really made me wince was reading my own prompts.

"Got an error. Please investigate and repair."

"Please get re-oriented."

"How do I get Python to be my default interpreter for VS Code on MacOS?"

I wasn't engineering with AI. I was having a conversation with a very patient tutor who happened to write code. And while I got a working prototype, the process took 40 days of back-and-forth. I'd built something, but I hadn't learned how to build with AI efficiently.

Fast forward to last week. I built a production-ready web chat application—React frontend, Express backend, LM Studio integration, 177 tests, E2E coverage, complete documentation—in 2.55 hours of actual agent work time. Not 40 days. Not 40 hours. 2.55 hours.

What changed wasn't the AI. What changed was me.


The Tale of Two Projects: Same Goal, Different Worlds

Project A: The 40-Day Prototype (March 2025)

What I asked for: A basic AI chat agent.  Everyone's first AI Engineering project.

My prompts looked like this:

  • "Update the design document for the OpenAI chat app to enable the user to be presented with the OpenAI model options..."
  • "Enhance the setup environment script to handle errors and exceptions."
  • "I believe I have my python3 setup now. Please verify the environment script uses the python3 command."
  • "Got an error. Please investigate and repair. I also just installed git."

What actually happened:

  • Constant context loss ("Please get re-oriented. I want to return to working on the diagram building agent...")
  • Iterative debugging cycles (error investigation repair new error)
  • Tool configuration questions ("How do I get Python to be my default interpreter?")
  • Design document reviews to check if code matched requirements

Result: Working prototype, 773 tests passing, but 44,500 tokens per work session just to remind the AI what we were building.

Time cost: 40 days of conversations. Unknown actual implementation hours (buried in chat history).

My role: Project manager, debugger, environment troubleshooter, requirements clarifier, documentation writer, integration tester, and occasional coder.

Project B: The 2.5-Hour Production App (2025)

What I asked for: A localhost web chat application with React, Express, and integration to my llm_caller_cli to leverage it's LM Studio integration option.

My prompts looked like this:

  • Initial: "Analyze this codebase and create a CLAUDE.md file for the web chat application project. Set up the V2.6 workflow."
  • Strategic: "Where are the other tasks you can run in parallel?"
  • Quality control: "Have the agents review the plans"
  • Correction: "Did you subtract out the time you were waiting on me?"

What actually happened:

  • AI agent created 25 work units, each with 7-agent reviews (Vision, Scope, Design, Simplicity, Testing, Validation, Tattle-Tale)
  • Define-and-deploy agents autonomously executed: definition plan review implementation output review commit
  • Parallel execution: Orchestrator workflow that delivered 10 work units simultaneously in Sprint 2 (2 independent tracks)
  • Post-commit automation: Test results, status updates, documentation all auto-generated

Result: Production-ready application with zero P0 blockers after remediation, 177 tests passing, E2E coverage, deployment documentation.

Time cost: 2.55 hours of agent work across 25 work units. Total elapsed: 22.24 hours (including sleep and my review time).

My role: Vision setter (5 minutes), strategic director (5 minutes), quality validator (5 minutes). 

Total hands on time: ~15 minutes.


The Five Levels of AI Engineering (I Went from 1 to 4)

Looking back, I can now see distinct maturity levels in how people work with AI:

Level 1: The Tutorial Student

Prompt pattern: "How do I...?"

AI role: Teacher

Result: You learn, but slowly

Example: "How do I get Python to be my default interpreter for VS Code on MacOS?"

Level 2: The Interactive Debugger

Prompt pattern: "Got an error. Please investigate."

AI role: Debugging partner

Result: Code works eventually

Example: "Please investigate and repair. I also just installed git."

I was here in Project A.

Level 3: The Requirements Manager

Prompt pattern: "Implement X with Y constraints."

AI role: Developer following your spec

Result: Features get built

Example: "Update the design document for the openAI chat app to enable the user to be presented with the OpenAI model options..."

I was still mostly here in Project A.

Level 4: The System Architect

Prompt pattern: "Here's the vision. Execute with these quality gates."

AI role: Autonomous delivery team

Result: Systems get built

Example: "Set up the V2.6 workflow with sprint orchestration and define-and-deploy agents."

I reached here in Project B.

Level 5: The Meta-Engineer

Prompt pattern: "Optimize the workflow itself."

AI role: Self-improving system

Result: Processes get better

Example: I'm not here yet, but the path is clear.


What Actually Changed: The Six Realizations

1. Workflows Matter More Than Prompts

Old approach: Better prompts = better results

New approach: Better process = better results

The V2.6 workflow I used in Project B isn't magic. It's just:

  • Structured work units (small, clear deliverables that don't expect too much of Claude)
  • Quality gates (7 AI agents review plans before implementation and code after)
  • Automated tracking (git hooks update status after every commit)
  • Parallel execution (independent work units run simultaneously)

But that structure eliminated 93% of my context-loading overhead. Instead of re-explaining the project every session (44,500 tokens), I read an 80-line status.json file (4,000 tokens).

Token economics matter. When you're paying per-token, efficiency isn't just speed—it's cost.

2. AI Autonomy Requires Human Structure

Paradox: The more structure you provide upfront, the more autonomous AI can be.

In Project A, I gave vague direction and provided constant correction. The AI needed me for every decision because I hadn't defined the boundaries.

In Project B, I spent 5 minutes defining vision (localhost MVP, React+Express, LM Studio, no auth/database). The AI then executed 25 work units with 96% autonomy.

User inputs in Project A: Hundreds of prompts over 40 days

User inputs in Project B: 15 strategic decisions over 22 hours

The difference? I learned to set constraints, not just goals.

3. Quality Gates Are Cheaper Than Rework

Project A pattern: Build discover issues fix discover new issues fix again

Project B pattern: Review plan fix issues build once verify ship

In Project B, every work unit went through 7-agent reviews before implementation and after build:

  • Vision Agent: "Does this solve the right problem?"
  • Scope Agent: "Is this too big?"
  • Design Agent: "Is this architecturally sound?"
  • Simplicity Agent: "Is this the simplest approach?"
  • Testing Agent: "How will we verify this?"
  • Validation Agent: "What are the success criteria?"
  • Tattle-Tale Agent: "Are the other agents being honest?"

This felt like overhead at first. But it caught 4 P0 blockers (including a command injection vulnerability) before they made it to production.

Cost of prevention: 7 agent reviews per work unit (15 seconds each)

Cost of fixing in production: Cost to discover, exposure while it wasn't known, cost to discuss between InfoSec or Product Management or the business or support, cost to replan the change, cost to develop the change, cost to move the change through change management, cost to test in staging, repeat UAT?, cost to deploy.

Quality gates aren't overhead. They're insurance.

4. Metrics Change Behavior

Project A: No visibility into cost, time, or efficiency

Project B: Every work unit tracked (time estimate, actual time, P0/P1/P2 issues)

When I could see that Sprint 2 was running sequentially when it could run in parallel, I asked: "Where are the other tasks you can run in parallel?"

That one prompt saved 1.5 hours (40% time reduction).

When the final report claimed we were 37% over budget, I asked: "Did you subtract out the time you were waiting on me?"

That correction revealed we were actually 86% under budget.

You can't optimize what you don't measure. And in Project B, everything was measured.

5. Trust Is Earned Through Verification

Bad trust: "The AI said it's done, so it's done."

Good trust: "The AI says it's done. Let me verify with evidence."

In Project B, I trusted the define-and-deploy agents to implement work units autonomously. But I also requested a post-hoc review after the release.

That review discovered 4 P0 blockers:

  • P0-1: Command injection vulnerability (used `exec` instead of `spawn`)
  • P0-2: Browser-blocking `window.confirm()` in production code
  • P0-3: False "Zero Bugs" claim in documentation
  • P0-4: False "Production Ready" claim in release notes

The AI had completed the work, but hadn't caught these issues because I hadn't added Security and UX review agents to the workflow.

Lesson: Trust the process, but verify the output. Autonomous doesn't mean unsupervised.

6. The Best Code Is the Code You Don't Write

Project A word count: 15,000 lines of chat, 2,000 lines of code

Project B breakdown :

Code written:

  • Production code: 1,294 lines (525 server + 769 client)
  • Test code: 4,277 lines (2,009 server + 2,268 client, 177 tests)
  • Test results (177 tests, 100% E2E coverage)
  • Configuration: 190 lines (tsconfig, package.json, vite, tailwind)
  • **Total code: 5,761 lines**

Documentation written:

  • Project docs: 1,982 lines (README, deployment checklist, known limitations, release notes)
  • Analysis reports: 18,150 lines (audits, comparisons, security findings, quality metrics)
  • Work units: 1,032 lines (25 work unit plans)
  • Agent reviews: 1,028 lines (7-agent quality gates × 25 units)
  • Workflow docs: 202 lines (status tracking, session resumes)
  • Status tracking (auto-updated after every commit)
  • 22,394 lines of documentation that provides complete audit trail and quality evidence
  • Complete documentation (README, API docs, deployment guides)
  • **Total documentation: 22,394 lines**

Code-to-documentation ratio: 1:3.9 (unusually high documentation—most projects are 10:1 code-to-docs)

In Project A, I spent 40 days generating a massive context log.

In Project B, I spent 15 minutes providing strategic direction.

I didn't write more or better prompts. We (Claude and I) built a system that didn't need them.


The Numbers: Why This Matters

Let me show you the efficiency delta in cold, hard metrics:

Time Investment

Metric

Project A (2025)

Project B (2025)

Improvement

Elapsed calendar time

40 days

22.24 hours

43x faster

Active implementation time

Unknown (buried in logs)

2.55 hours

Measurable vs unmeasurable

User active input time

Hours per day

15 minutes total

~100x less

Context loading per session

44,500 tokens

4,000 tokens

93% reduction


Cost Economics (at $150/hr developer rate)

Metric

Project A

Project B

Savings

Estimated implementation cost

$7,200 (48 hrs × $150)

$383 (2.55 hrs × $150)

$6,817 saved

Actual delivery time

40 days

<1 day

40x faster

User oversight cost

Hours × $150/hr

15 min × $150/hr = $37.50

Negligible


Quality Metrics

Metric

Project A

Project B

Change

Tests

773 passing

177 passing

Fewer tests, better coverage

Test types

Unit only

Unit + Integration + E2E

Comprehensive

P0 blockers at release

Unknown

0 (after remediation)

Verified safe

Documentation completeness

Partial

Complete (22,394 lines)

Production-ready

Known vulnerabilities

Unknown

0 (security review caught)

Verified secure


Autonomy Metrics (Project B only)

Metric

Value

Agent autonomy

96%

User strategic input

4%

User leverage ratio

10:1 (10 hours agent work per 1 hour user time)

Work units executed in parallel

10 (Sprint 2)

Time saved by parallelization

1.5 hours (40% reduction)


The Uncomfortable Truth: I Was the Bottleneck

Here's what I didn't want to admit after Project A:

The AI wasn't the problem. I was.

  • I asked vague questions got vague answers
  • I didn't define boundaries got scope creep
  • I didn't structure the work got tangled dependencies
  • I didn't automate tracking lost visibility
  • I debugged interactively wasted time on iterations

The AI did exactly what I asked. The problem was I didn't know how to ask effectively.

Project B forced me to confront this because the V2.6 workflow requires precision:

  • Work units must be small (1-5 files, single clear objective)
  • Acceptance criteria must be testable
  • Dependencies must be explicit
  • Quality gates must be defined upfront

I couldn't be vague anymore. The system demanded clarity.

And that clarity—more than any prompt engineering trick—is what delivered a production app in 2.55 hours.


What This Means for You (Three Takeaways)

1. Stop Optimizing Prompts, Start Optimizing Process

The difference between 40 days and 2.5 hours wasn't better prompts. It was:

  • Structured work units = small enough chunks of work that the AI could handle
  • Automated quality gates requiring challenge to reduce hallucinations, overshooting
  • Parallel execution for optimal velocity
  • Verification workflows to confirm functionality wasn't broken by a change and log issues when it did

Action: Before writing your next prompt, ask: "What structure would make this prompt unnecessary?"


2. Your Time Should Be Strategic, Not Tactical

In Project B, my 15 minutes of input delivered 2.55 hours of agent work (10:1 leverage).

My inputs were:

  • Vision definition (5 min)
  • Quality gate triggers (5 min)
  • Accuracy corrections (5 min)

I didn't write code. I didn't debug. I didn't configure environments. Like an effective manager working with a team of human experts, I set direction and enforced standards and built quality checks to assure we got support to anyone who needed it quickly.

Action: If you're spending more than 20% of your time on tactical AI interactions, you need better automation.


3. Quality Gates Are Your Competitive Advantage

The 4 P0 blockers discovered in Project B (command injection, blocking UI, false documentation) would have cost far more to remediate in production.

The 7-agent reviews that caught them? Cost: ~15 seconds per work unit.

Return on investment: ~1000x.

Action: Add review agents to your workflow. Start with: Security, UX, Scope, Testing. Consider Elegance, Code Quality, and just plain Critic.  Let them challenge the AI's draft plans before implementation and the code just added because any missed issue or rabbit trail magnifies - just like with human workflows.


The Path Forward: What I'm Building Next

I'm definitely not done learning. Project B showed me what's possible, but also where I'm still falling short:


Current limitations:

1. I still need to manually trigger some post-hoc reviews (should be automatic but Claude still forgets)

2. Security and UX agents aren't in the default workflow (must be)

3. Parallel execution requires manual identification (could be automatic)

4. Quality gate enforcement is reactive (should be proactive blocking)


Next evolution:

  • MEMORY!!!  Claude keeps repeating errors.  Let's add some related experience to the context
  • Orchestrator and Sprint Planner

The goal is not to replace human judgment. It's to reserve human judgment for the 4% of decisions that actually matter.


The Question That Changed Everything

Six months ago, after finishing Project A, I asked myself:

"Why is this so complicated?  Am I micro-managing when I don't need to?  How can we do this better?"

I thought the answer was better prompts. Faster typing. More ChatGPT Plus or Claude credits.

The real answer was embarrassingly simple:

Stop having conversations. Start building systems.

Systems that:

  • Define clear boundaries (work units, not projects)
  • Enforce quality gates (reviews before implementation)
  • Measure everything (time, cost, P0 count)
  • Automate repetition (git hooks, status updates)
  • Parallelize when possible (independent work = simultaneous execution)

Project A was a 40-day conversation with an AI.

Project B was a 2.5-hour system execution.

The AI didn't get smarter. I got more systematic.

And if you're reading this thinking, "But I don't have access to the V2.6 workflow," you're missing the point.

The workflow isn't the innovation. The discipline is.

You can apply these principles with any AI tool (and with the humans trusting you to lead):

1. Define work in small, testable units

2. Review complex plans before implementing

3. Simplify status tracking

4. Measure time and cost to help everyone see when to improve

5. Run parallel work whenever possible to optimize velocity


You don't need my exact setup. You just need the mindset shift.


No comments:

Post a Comment