Dec 5, 2025

The $0.42 Question: When Your AI Needs Guardrails (And When It Doesn't)

 aka Two AIs Walk Into a Chat App... One Brings a Framework, One Brings Confidence

December 4, 2025

 

Why I Ran This Experiment

I've spent months building a workflow system to keep AI agents from shipping sloppy code. Reviews, memory systems, work unit tracking—the whole nine yards. I believed (maybe too strongly) that guardrails prevent problems.

 

Then I decided to test that belief.


I gave Claude the same task twice. Same goal: build a web chat app that talks to LM Studio. Same human (me). Same basic capability.

The first time, I let Claude Opus 4.5 work alone. No workflow. No reviews. Just "build me a chat app" and get out of the way.

The second time, I used the V2.9 workflow—agent reviews, memory queries, parallel work units, the works.

What I got were two working applications built in nearly identical time. The workflow cost $0.42 more. Was it worth it?

That's the question this post explores.

  

The Surface Numbers

Metric

Claude Alone

V2.9 Workflow

Time to working app

18 min

19 min

Estimated token cost

~$0.83

~$1.25

Cost difference

+$0.42 (50% more)

Files created

4

6+

Lines of code*

~380

~360

Bugs found

3

3

*Line counts from project assessments at time of delivery. Files may have changed slightly during final edits.

 

At first glance, Claude alone looks like the clear winner. Nearly the same time, 50% cheaper, fewer files to manage. If I stopped here, I'd conclude that the workflow is expensive overhead.


I didn't stop here.

 

What The Bugs Revealed

Both approaches found exactly 3 bugs. But the bugs were different species.

 

Claude Alone's Bugs

1. **Wrong model name** — Hardcoded `local-model` instead of the actual LM Studio model name. Time to find: 3 min. Time to fix: 1 min.

2. **Wrong response parsing** — Expected one JSON format but `llm-call` returns a different structure. In other words: the AI assumed one data format but got another. Time to find: 2 min. Time to fix: 2 min.

3. **Stale server process** — Old instance still running on port 8080 with the buggy code. Time to find: 5 min. Time to fix: 1 min.

 

Investigation-to-fix ratio: 2.5:1 — meaning bugs took 2.5x longer to find than to fix. Ten minutes finding bugs. Four minutes fixing them. All discovered after Claude declared itself "done."

 

V2.9 Workflow's Bugs

1. **Integration gap** — The FastAPI endpoint and LLM service were built as independent components. Neither knew about the other. Discovered during user acceptance testing (UAT).

2. **Response parsing** — Similar to Claude alone, caught during integration testing.

3. **Port conflict** — Port 8000 already in use. Caught immediately at startup.

 

Total remediation time: 4 min

 

The difference isn't bug count. It's when and how they manifested.

  

The Investigation Paradox

Claude alone spent 10 minutes in debugging purgatory after claiming victory. The workflow spent that time in reviews before implementation.

 

Think of it like two chefs making the same dish. One tastes as they go, catching a too-salty broth before it ruins dinner. The other cooks fast, then spends 10 minutes frantically adjusting the seasoning at the end. Same total time. Same final quality. But one chef was stressed at the end, and the other controlled the process throughout.

 

The uncomfortable truth: the workflow didn't prevent bugs—it transformed them. Claude alone's bugs were careless errors (wrong names, wrong formats). V2.9's bugs were design gaps (components that work alone but don't fit together). Both require human intervention, but design gaps appear earlier in the process.

 

You're not eliminating problems. You're choosing which problems you want.

 

 

The Architecture That Emerged vs. The Architecture That Was Designed

Here's where the real difference hides.

 

Claude Alone Built This:

├── index.html
├── style.css
├── app.js
└── server.py       ← Python's SimpleHTTPRequestHandler

 

Four files. One server. No dependencies. You could deploy this to a VM with just Python installed and it would work. No build step. No types. You could understand the entire codebase in 5 minutes.

 

V2.9 Workflow Built This:

├── main.py                           ← FastAPI with validation
├── src/
  └── services/
      └── llm_service.py            ← Service abstraction
├── frontend/
  ├── src/
    ├── App.tsx                   ← TypeScript React
    └── App.css
  └── vite.config.ts
└── requirements.txt

 

More files. More structure. More dependencies. But each piece has one job, and swapping pieces doesn't require understanding the whole system. The service layer pattern, type signatures, and separation of concerns make the intended structure visible without extensive comments.

 

Same feature. Similar line count. Completely different architecture.

  

The Question Nobody Wants To Answer

The tough question that every CIO and COO fight to figure out to achieve optimal technology evolution velocity.  Which one is better?

 

Here's the uncomfortable answer: It depends on whether this code will outlive this conversation.

 

If This Is Throwaway Code

Claude alone wins. Lower cost, same time, adequate quality. The simple server approach is perfectly fine for demos, exploration, and learning. To be honest about attribution—Claude built exactly what I asked for in exactly the right way for a prototype.

 

If This Code Needs To Grow

V2.9 wins. To add features to Claude alone's solution:

  • Add login? Rewrite the server.
  • Add message history? Refactor the JavaScript.
  • Add another developer? Good luck explaining the implicit architecture.

To add features to V2.9's solution:

  • Add login? Add authentication middleware (about 20 lines).
  • Add message history? New data model, new service method.
  • Add another developer? The architecture documents itself through structure.

The $0.42 premium bought a production-ready foundation. Whether that's worth it depends on your definition of "done."   To put this in CFO's terms, this is a ~ 50% cost increase if we just think of the tokens as coder hours but our architects, BAs, QA, and change management resources trend not to equal the cost of pure coders so the comparison falls apart if we compare to human labor vs AI labor costs as the 50% increase is trivial added impact to TCO compared to the alternative.

 

Where Did The Extra Tokens Go?

This is the number that stopped me cold.

 

V2.9 consumed roughly 9x more tokens than Claude alone for similar code output.

 

Same delivery time  (18-19 min). Similar lines of code (360-380). Where did all those tokens go?

 

The answer: reviews and planning.

 

The workflow generated 7 vision elements, 11 work units, and 9 builder agent reviews. The agents weren't writing more code—they were thinking about the code. Validating approaches. Catching edge cases. Checking for input validation.

You're not paying for execution speed. You're paying for quality assurance upfront. To put this in perspective: that $0.42 is the cost of structured review. Whether that's expensive depends on whether you're comparing it to prototype budget or production-incident budget.

  

The Integration Bug That Shouldn't Happen (But Did)

Here's the lesson that genuinely surprised me.

 

V2.9's biggest bug was an integration gap. The sprint runner parallelized work units for speed—9 work units executed in 12.5 minutes of wall-clock time, achieving about 3.2x effective parallelism. But parallel work creates parallel ignorance. One work unit built an endpoint. Another built a service. Neither knew the other existed.

 

The irony: the feature I built for speed (parallel execution) created a new class of bugs (integration failures).


Root cause from the project report:

The sprint runner created independent work units that each worked in isolation but were not integrated. The FastAPI endpoint and LLM service were built as separate components with placeholder code, expecting future integration that wasn't explicitly assigned to a work unit.

 

The fix is obvious in hindsight: the Planner should generate explicit integration work units. But I didn't think of that until the bug bit me.

This is the pattern: reviews prevent careless errors, but architectural choices create new risks. The workflow didn't fail. My workflow design had a gap.

 

The Decision Framework

After running both approaches, here's the decision framework I'd use—with full knowledge that context matters more than rules:

 

If Your Code...

Use

Will be deleted tomorrow

Claude alone

Is for learning or exploration

Claude alone

Needs to demo in 20 minutes

Claude alone

Will be maintained next month

V2.9 workflow

Will have multiple developers

V2.9 workflow

Will add features over time

V2.9 workflow

Goes to production

V2.9 workflow

The inflection point is code lifespan. Short-lived code should be cheap. Long-lived code should be structured.

  

The Uncomfortable Truth

The workflow didn't eliminate judgment. It changed what you judge.

 

Without workflow: "Is this code good enough?"

With workflow: "Is this integration complete?"

 

Neither approach removes the human decision. They just move it.

 

Claude alone requires you to catch implementation errors in testing. The workflow requires you to catch integration gaps in planning. Both require you to actually verify that the thing works before declaring victory.

 

The $0.42 question isn't really about cost. It's about where you want to spend your coder's attention—debugging after the fact, or designing upfront.

 

What I Actually Learned

1. **Time was nearly identical.** The workflow's overhead was offset by parallelization. Neither approach is faster for small projects.

2. **Cost scales with thoroughness, not output.** 9x more tokens for similar code means reviews are expensive. Whether that expense is waste or investment depends on code lifespan.

3. **Bug count was identical.** Process doesn't eliminate bugs. It transforms implementation errors into integration gaps. Pick your preferred failure mode.

4. **Architecture was not identical.** This is the real difference. Same line count, completely different maintainability trajectory.

5. **The integration gap was my fault.** The workflow exposed a design flaw in my Planner: it doesn't generate integration tasks. That's on me, not the workflow.

 

The Meta-Lesson

I built a workflow to prevent AI from cutting corners. Then I ran an experiment to see if the corners needed cutting.

 

The answer is nuanced: corners should be cut for throwaway code, and shouldn't be cut for production code. The workflow isn't overhead when you need the structure. It's overhead when you don't.

 

The workflow can't answer the question that must come before it: "What am I actually building—and how long will it need to live?"

 

That's still my job. Turns out, that's the most important one.  

 

Appendix A - Fixing v 2.9 did not require a workflow upgrade

 

How did I fix the integration failure mode you asked?  Just like with a human project team, we added a new anti-pattern to the global pattern memory that was added in v2.8 so our vision and planner agents can recall this when they work.


Here's what it looked like when I asked v2.9 to store the lesson learned.


Write(/Users/user/.claude/patterns/PARALLEL-INTEGRATION-GAP-001.yaml)
    Wrote 77 lines to /Users/user/.claude/patterns/PARALLEL-INTEGRATION-GAP-001.yaml     pattern_id: PARALLEL-INTEGRATION-GAP-001
 title: "Anti-Pattern: Parallel Work Units Without Integration Tasks"

version: '1.0'

type: anti-pattern

category: process

domain:

   - universal

   - workflow-automation

confidence: HIGH

priority: P1

description: |
   Parallel execution of independent work units creates components that work
   in isolation but aren't wired together. Each unit builds to its own spec,
   using placeholder code or mocks, expecting "future integration" that never
   gets explicitly assigned. Result: working parts that don't connect.
 
 detection_signals:
   - "Multiple work units touching different layers (API, service, frontend)"
   - "Work units with placeholder/mock implementations"
   - "No explicit integration or wiring task in the plan"
   - "Components pass unit tests but fail integration tests"
   - "UAT reveals 'X doesn't call Y' bugs"
 
 ai_susceptibility: |
   AI planners optimize for parallelization to reduce wall-clock time. When
   breaking work into independent units, they assume integration is implicit
   or will be handled "naturally." The AI doesn't recognize that parallel
   ignorance compounds - each unit knows its own contract but not how others
   implement theirs.
 
 recommended_action:
   - "When planning parallel work units that must connect, add explicit integration task"
   - "Integration task should: wire components, verify data flows end-to-end"
   - "Integration task depends on all component tasks completing first"
   - "For N components that connect, need at least 1 integration task"
   - "Consider: API + Service + Frontend = 3 components = need integration WU"
 
 example:
   bad: |
     WU-001: Create FastAPI endpoint (returns placeholder)
     WU-002: Create LLM service (callable function)
     WU-003: Create React frontend (calls /api/chat)
     # Result: Frontend calls API, but API doesn't call service
 
   good: |
     WU-001: Create FastAPI endpoint (returns placeholder)
     WU-002: Create LLM service (callable function)
     WU-003: Create React frontend (calls /api/chat)
     WU-004: Wire endpoint to service, verify end-to-end flow
     # Result: Integration task catches the gap before UAT
 
 keywords:
   - parallel-execution
   - integration
   - work-units
   - planning
   - components
   - wiring
   - end-to-end
 
 evidence:
   projects:
     - name: v2.9-test2
       details: "Sprint runner parallelized 9 WUs. FastAPI endpoint and LLM service built independently. Neither called the other. Discovered in UAT."
       confidence_contribution: HIGH
 
 usage:
   suggested: 0
   applied: 0
   created: '2025-12-04'
   updated: '2025-12-04'
 
 maintenance:
   status: ACTIVE
   next_review: '2026-03-04'


 

Appendix B: V2.9 Workflow Features — Deterministic vs. Non-Deterministic Decision Making

The V2.9 workflow isn't just "more process." It's a deliberate separation of concerns: deterministic guardrails that enforce quality gates, and non-deterministic intelligence that makes judgment calls. This mirrors how traditional software teams operate—junior developers follow checklists while senior architects make design decisions.   Incidentally, this is why CIOs for years now have been adopting CI/CD pipelines with security and code quality build into the pipeline.  Now that we have LLM based agents that can understand and even write quality SDLC documentation to gain agreement between business, the many other stakeholders, and the developers we have opportunity rapidly build and deploy high quality code.   We just cannot leave it all up to an LLM trained on average coders.

 

The Feature Set

Feature

Type

Purpose

Traditional SDLC Equivalent

Three-Tier Hierarchy

Deterministic

Route work to appropriate review depth

Project manager triaging tickets

Pre-commit Hooks

Deterministic

Block commits without required reviews

CI/CD gates

Work Unit Tracking

Deterministic

Enforce atomic, archivable chunks

Sprint backlog items

Graph Memory

Non-deterministic

Query past decisions before new ones

Senior dev's institutional knowledge

Pattern Library

Non-deterministic

Cross-project wisdom retrieval

Company coding standards

Agent Reviews

Non-deterministic

AI judgment on code quality

Code review by peers

Greenfield Detection

Deterministic

Skip local memory when irrelevant

"This is a new project" decision

Familiarity Scoring

Non-deterministic

Decide research depth per element

"Have we done this before?"


Deterministic Components (The Guardrails)

These features enforce process without judgment. They're cheap to run—just config parsing and file checks.

 

Three-Tier Hierarchy: Epic Story Task. The tier determines review requirements:

  • Epic: Planner + Tattle-Tale reviews (architectural scrutiny)
  • Story: Sprint + Tattle-Tale reviews (implementation planning)
  • Task: Builder review only (quick validation)

 

A typo fix (Task) doesn't need architectural review. A new authentication system (Epic) does. The workflow enforces this automatically.

 

Pre-commit Hooks: Commits are blocked unless:

  • Required reviews exist for the tier
  • Review artifacts pass frontmatter validation
  • No secrets detected in staged files

 

This is the "cannot ship without signoff" gate that traditional teams enforce via PR approvals.

 

Work Unit Tracking: Every change is wrapped in a work unit with:

  • Unique ID (WU-001, WU-002-01, etc.)
  • Defined scope (files affected)
  • Required reviews (based on tier)
  • Archive on completion

 

This creates an audit trail and forces atomic, describable changes.

 

Non-Deterministic Components (The Intelligence)

These features require AI judgment. They're more expensive but provide the "thinking" that catches design gaps.

 

Graph Memory: A local SQLite database tracking:

  • Architecture nodes (modules, classes, functions)
  • Workflow nodes (work units, sessions, reviews)
  • Decision nodes (ADRs, patterns, constraints)
  • Relationship edges (IMPORTS, DEPENDS_ON, MODIFIED)

 

Before modifying code, the AI queries: "What depends on this?" Before creating a work unit: "Have we solved this before?"

 

Pattern Library: Cross-project wisdom stored in ~/.claude/patterns/. Each pattern documents:

  • Detection signals (how to recognize the situation)
  • Recommended actions (what to do)
  • Evidence (which projects taught us this)

 

The new PARALLEL-INTEGRATION-GAP-001 pattern came from this exact experiment.

 

Agent Reviews: Specialized AI agents with focused prompts:

  • Builder: "Is this task atomic? Are there edge cases?"
  • Sprint: "Are these tasks correctly sequenced?"
  • Planner: "Does this epic break into independent stories?"
  • Tattle-Tale: "Do the reviews agree? What's the priority?"

 

Each agent returns P0/P1/P2 findings. P0 blocks progress.

 

Familiarity Scoring: For each element in a vision document:

1. Query graph for existing components

2. Query embeddings for similar past work

3. Query patterns for relevant wisdom

4. Score: High familiarity = light touch, Low = deep research

 

This prevents over-researching familiar territory and under-researching novel problems.

 

The Cost Tradeoff

Component Type

Token Cost

Failure Mode

Recovery Cost

Deterministic

Near zero

Blocked commit

Fix and retry

Non-deterministic

~80% of workflow tokens

Missed design gap

Debug in production


The V2.9 workflow spends tokens on non-deterministic judgment precisely because debugging production issues costs more than upfront review. The $0.42 premium is the cost of having AI do what a senior developer would do—think before coding.

 

Traditional SDLC Comparison

In a human team, you don't pay senior engineers to run linters. You pay them to:

  • Review architecture decisions
  • Catch integration gaps
  • Remember past mistakes
  • Ask "have we considered...?"

 

The V2.9 workflow separates these the same way:

  • **Deterministic features**: Linting, gating, tracking (cheap, automated)
  • **Non-deterministic features**: Review, memory, judgment (expensive, intelligent) 

The key insight: you can't automate judgment, but you can automate the decision to invoke judgment. Tier selection is deterministic. What the reviewer thinks is not.

  

Appendix C: The Story From Git History aka Observability and Evaluation across experiments is key

Here's what actually happened in each approach, reconstructed from git logs and project artifacts.

 

Claude Alone (opusalone)

No git commits. The entire project was built in a single session without version control. The only artifact is the final code and the PROJECT_ASSESSMENT.md written afterward.

 

The Full Transcript (reconstructed from chat history):

 

Time

Claude's Action

What Actually Happened

0:00

User: "Create a basic web AI chat app"

Claude started planning

0:00

User interrupted: "use my tool in ../tool/llm_caller_cli"

Redirected to use existing CLI

0:02

Read llm_caller_cli README (318 lines)

Understood the tool's JSON request format

0:03

Wrote index.html (37 lines)

Chat interface markup

0:04

Wrote style.css (201 lines)

Dark theme with animations

0:05

Wrote app.js (116 lines)

Frontend chat logic

0:06

Wrote server.py (106 lines)

Python backend calling llm-call

0:07

chmod +x server.py

Made executable

0:07

Declared "done"

Listed the files, gave run instructions


Then the user asked: "Didn't work. Did you test it?"

 

Time

Claude's Action

What Actually Happened

0:08

Started server in background

python server.py &

0:08

Tested homepage

curl http://localhost:8080/ — worked

0:09

Tested chat API

Got error: "Invalid model identifier llama-2-13b-chat"

0:09

Diagnosed: wrong model name

Server used DEFAULT_MODEL = "local-model"

0:10

Tested llm-call directly

With correct model name — worked!

0:11

Updated server.py

Changed to deepseek-coder-v2-lite-instruct

0:11

Also fixed response parsing

llm-call returns OpenAI-style choices array

0:12

Restarted server, tested again

Same error: llama-2-13b-chat

0:13

Confused: "The file looks correct"

Checked config files, searched for hardcoded values

0:14

Ran subprocess directly in Python

Worked perfectly

0:15

User hint: "are you calling it a CLI like the README says?"

Red herring — it was

0:16

Verified llm-call is installed

which llm-call — yes

0:17

Restarted server again

Same error

0:18

Finally checked: ps aux | grep server

Found PID 11459 from 7:59PM — old instance!

0:19

Killed old process, freed port 8080

kill 11459

0:20

Started fresh server

Works!

0:21

Verified with curl

Success response


The Three Bugs:

 

1. **Wrong model name** (`local-model` `deepseek-coder-v2-lite-instruct`)

2. **Wrong response parsing** (expected `status/response` but got `choices` array)

3. **Stale server process** (old instance on port 8080)

 

The Investigation-to-Fix Ratio: 2.5:1

 

Claude spent 10 minutes finding bugs and 4 minutes fixing them. The fixes were trivial once identified. The hard part was understanding why the code wasn't working when "the file looks correct."

 

What Didn't Exist:

  • No git commits
  • No planning document
  • No review artifacts
  • No work unit tracking
  • No memory of the session after it ended

 

The "Planning" was Claude's internal reasoning, visible only in the chat transcript. Six months from now, there's no artifact explaining why deepseek-coder-v2-lite-instruct was chosen or why the response parsing looks the way it does.

 

V2.9 Workflow (v2.9-test2)

14 git commits. Each work unit created a commit trail.

 

The Full Transcript (reconstructed from chat history):

 

Phase 1: Vision Creation (~5 min)

 

Time

What Happened

0:00

User: /vision command

0:00

Claude: "What is the source? A) New feature B) Existing analysis C) Bug fixes D) Refactoring"

0:01

User: "A"

0:01

Claude: "Please tell me: 1. High-level goal? 2. Key components? 3. Files affected? 4. Constraints?"

0:02

User provided all 4 answers in one message

0:02

Claude: "Quick clarifications: Framework? CLI usage? Chat style? Project name?"

0:03

User: "React, CLI app, single Q&A, v2.9-test2-chat-PoC"

0:03

Claude: Read llm_caller_cli README (318 lines)

0:04

Claude: Drafted vision with 7 elements

0:05

User: "save"

0:05

Vision saved to .claude/vision.md

Phase 2: Planning (~3 min)

 

Time

What Happened

0:06

User: /planner .claude/vision.md

0:06

Error: No venv, no pyyaml installed

0:07

Claude: Created venv, installed pyyaml

0:07

Planner ran memory queries for each element

0:08

Result: 7 elements, all LOW familiarity (greenfield), 21 context items from patterns

0:08

User: "proceed"

0:09

Planner generated 11 work units across 1 sprint

The Work Unit Breakdown:

WU-001    Create FastAPI backend with /api/chat    (Task)
WU-002    Implement subprocess call to llm-call    (Story - parent)
WU-002-01   Plan: subprocess call                  (Task)
WU-002-02   Implement: subprocess call             (Task)
WU-003    Parse JSON response from CLI             (Task)
WU-004    Create React frontend with TypeScript    (Task)
WU-005    Build simple chat UI component           (Task)
WU-006    Add loading state                        (Task)
WU-007    Configure CORS and localhost-only        (Story - parent)
WU-007-01   Plan: CORS config                      (Task)
WU-007-02   Implement: CORS config                 (Task)

 

Phase 3: Sprint Execution (~12.5 min wall-clock, ~40 min agent-time)

 

Time

What Happened

0:10

User: /sprint

0:10

Sprint runner launched 9 work units in parallel

0:10

Claude: Monitoring progress (7/9, 8/9, 9/9...)

0:22

All 9 work units complete


Sprint Results:

Work Unit

Agent Time

Description

WU-003

64.6s

Parse JSON response

WU-007-02

83.0s

Implement CORS

WU-001

228.0s

FastAPI backend

WU-002-02

379.6s

Subprocess call

WU-004

322.3s

React frontend

WU-005

277.3s

Chat UI

WU-002-01

452.1s

Plan subprocess

WU-006

269.7s

Loading state

WU-007-01

353.8s

Plan CORS


Effective parallelism: 3.2x (40 min agent-time in 12.5 min wall-clock)

 

Phase 4: UAT and Integration Bug Discovery (~4 min)

 

Time

What Happened

0:23

User: "Start it up so I can UAT"

0:23

Claude: Started backend (port 8000 in use, switched to 8001)

0:24

Claude: Updated Vite proxy, started frontend

0:24

User: "Isn't working...test it"

0:25

Claude: Tested API returns [Placeholder] Received question:

0:25

User: "Wire it in...why didn't you?"

0:26

Claude: "You're right - the sprint created the pieces but didn't wire them together"

0:27

Fixed: Added import, wired call_llm() into endpoint

0:27

Fixed: Response parsing (choices[0].message.content)

0:28

Restart backend Working!

The Integration Gap Explained:

 

The sprint runner created work units for each vision element:

  • WU-001 created `main.py` with a **placeholder** response
  • WU-002-02 created `src/services/llm_service.py` with `call_llm()` function
  • WU-005 created `App.tsx` calling `/api/chat`

 

Each work unit passed its builder review. Each component worked in isolation. But:

  • `main.py` never imported `llm_service`
  • The endpoint returned placeholder text, not LLM responses
  • The frontend called the API, which returned fake data

 

Why didn't the workflow catch this?

 

The planner created independent work units. WU-001 was scoped to "Create endpoint that accepts questions" — and it did. WU-002-02 was scoped to "Implement subprocess call" — and it did. Neither was scoped to "Wire endpoint to service."

 

The sprint runner optimized for parallelism. All 9 units ran simultaneously. Each unit completed its own objective. None knew what the others were building.

 

The fix required human discovery: "Wire it in...why didn't you?"

 

The Git Trail:

 

Commit

Work Unit

Description

fa61aa9

WU-001

Create FastAPI backend with /api/chat POST endpoint

d9468f6

WU-001

Archive work unit

acd5cf9

WU-004

Create React frontend with TypeScript using Vite

c3dab2d

WU-004

Archive work unit

fdb2933

WU-002-02

Implement subprocess call to llm-call CLI

0843d71

WU-002-02

Archive work unit

23c0338

WU-002-01

Plan: Implement subprocess call with builder review

1583231

WU-002-01

Archive work unit

3b3f6e7

WU-005

Build simple chat UI component

c7672ba

WU-005

Archive work unit

e114710

WU-006

Add loading state while waiting

cd14949

WU-006

Archive work unit

bcfbef2

WU-007-01

Configure CORS and ensure localhost-only

db6ed75

WU-007-01

Archive work unit

Note: The integration fix (wiring llm_service into main.py) happened interactively after the sprint, not as a tracked work unit. This is the gap the PARALLEL-INTEGRATION-GAP-001 pattern now addresses.

 

The Telling Difference

Claude Alone: No breadcrumbs. If someone asked "why is it built this way?" the answer is "because Claude built it in one shot." No decision trail, no review artifacts, no archived work units.

 

V2.9 Workflow: Full breadcrumbs. Each work unit has:

  • A commit with descriptive message
  • An archive in `.claude/work-units/`
  • Builder review artifacts in `.claude/agent-reviews/`
  • Links to modified files in the graph (via MODIFIED edges)

 

Six months from now, someone could trace WU-005 back to the vision element "Build simple chat UI component" and understand why it exists.

 

Bug Discovery Timeline Comparison

Claude Alone

V2.9 Workflow

Bug 1 found at t+11min (after "done")

Bug 1 (integration gap) found during UAT

Bug 2 found at t+14min

Bug 2 (response parsing) found during integration

Bug 3 found at t+21min

Bug 3 (port conflict) found immediately at startup

All bugs: implementation errors

All bugs: integration/config issues


The V2.9 bugs weren't "mistakes"—they were gaps in the plan. The workflow surface these earlier (during testing phases) rather than later (after declaring victory).

 

The Meta-Lesson

Claude alone produces code. V2.9 produces code and a decision trail.

 

For throwaway code, the decision trail is waste. For production code, the decision trail is the documentation that explains why the code exists and how it evolved.

 

The $0.42 / 50% upcharge in non-deterministic effort bought that trail.

 

No comments:

Post a Comment