aka Two AIs Walk Into a Chat App... One Brings a Framework, One Brings Confidence
December 4, 2025
Why I Ran This Experiment
I've spent months building a workflow system to keep AI agents from shipping sloppy code. Reviews, memory systems, work unit tracking—the whole nine yards. I believed (maybe too strongly) that guardrails prevent problems.
Then I decided to test that belief.
I gave Claude the same task twice. Same goal: build a web chat app that talks to LM Studio. Same human (me). Same basic capability.
The first time, I let Claude Opus 4.5 work alone. No workflow. No reviews. Just "build me a chat app" and get out of the way.
The second time, I used the V2.9 workflow—agent reviews, memory queries, parallel work units, the works.
What I got were two working applications built in nearly identical time. The workflow cost $0.42 more. Was it worth it?
That's the question this post explores.
The Surface Numbers
Metric | Claude Alone | V2.9 Workflow |
Time to working app | 18 min | 19 min |
Estimated token cost | ~$0.83 | ~$1.25 |
Cost difference | — | +$0.42 (50% more) |
Files created | 4 | 6+ |
Lines of code* | ~380 | ~360 |
Bugs found | 3 | 3 |
*Line counts from project assessments at time of delivery. Files may have changed slightly during final edits.
At first glance, Claude alone looks like the clear winner. Nearly the same time, 50% cheaper, fewer files to manage. If I stopped here, I'd conclude that the workflow is expensive overhead.
I didn't stop here.
What The Bugs Revealed
Both approaches found exactly 3 bugs. But the bugs were different species.
Claude Alone's Bugs
1. **Wrong model name** — Hardcoded `local-model` instead of the actual LM Studio model name. Time to find: 3 min. Time to fix: 1 min.
2. **Wrong response parsing** — Expected one JSON format but `llm-call` returns a different structure. In other words: the AI assumed one data format but got another. Time to find: 2 min. Time to fix: 2 min.
3. **Stale server process** — Old instance still running on port 8080 with the buggy code. Time to find: 5 min. Time to fix: 1 min.
Investigation-to-fix ratio: 2.5:1 — meaning bugs took 2.5x longer to find than to fix. Ten minutes finding bugs. Four minutes fixing them. All discovered after Claude declared itself "done."
V2.9 Workflow's Bugs
1. **Integration gap** — The FastAPI endpoint and LLM service were built as independent components. Neither knew about the other. Discovered during user acceptance testing (UAT).
2. **Response parsing** — Similar to Claude alone, caught during integration testing.
3. **Port conflict** — Port 8000 already in use. Caught immediately at startup.
Total remediation time: 4 min
The difference isn't bug count. It's when and how they manifested.
The Investigation Paradox
Claude alone spent 10 minutes in debugging purgatory after claiming victory. The workflow spent that time in reviews before implementation.
Think of it like two chefs making the same dish. One tastes as they go, catching a too-salty broth before it ruins dinner. The other cooks fast, then spends 10 minutes frantically adjusting the seasoning at the end. Same total time. Same final quality. But one chef was stressed at the end, and the other controlled the process throughout.
The uncomfortable truth: the workflow didn't prevent bugs—it transformed them. Claude alone's bugs were careless errors (wrong names, wrong formats). V2.9's bugs were design gaps (components that work alone but don't fit together). Both require human intervention, but design gaps appear earlier in the process.
You're not eliminating problems. You're choosing which problems you want.
The Architecture That Emerged vs. The Architecture That Was Designed
Here's where the real difference hides.
Claude Alone Built This:
├── index.html
├── style.css
├── app.js
└── server.py ← Python's SimpleHTTPRequestHandler
Four files. One server. No dependencies. You could deploy this to a VM with just Python installed and it would work. No build step. No types. You could understand the entire codebase in 5 minutes.
V2.9 Workflow Built This:
├── main.py ← FastAPI with validation
├── src/
│ └── services/
│ └── llm_service.py ← Service abstraction
├── frontend/
│ ├── src/
│ │ ├── App.tsx ← TypeScript React
│ │ └── App.css
│ └── vite.config.ts
└── requirements.txt
More files. More structure. More dependencies. But each piece has one job, and swapping pieces doesn't require understanding the whole system. The service layer pattern, type signatures, and separation of concerns make the intended structure visible without extensive comments.
Same feature. Similar line count. Completely different architecture.
The Question Nobody Wants To Answer
The tough question that every CIO and COO fight to figure out to achieve optimal technology evolution velocity. Which one is better?
Here's the uncomfortable answer: It depends on whether this code will outlive this conversation.
If This Is Throwaway Code
Claude alone wins. Lower cost, same time, adequate quality. The simple server approach is perfectly fine for demos, exploration, and learning. To be honest about attribution—Claude built exactly what I asked for in exactly the right way for a prototype.
If This Code Needs To Grow
V2.9 wins. To add features to Claude alone's solution:
- Add login? Rewrite the server.
- Add message history? Refactor the JavaScript.
- Add another developer? Good luck explaining the implicit architecture.
To add features to V2.9's solution:
- Add login? Add authentication middleware (about 20 lines).
- Add message history? New data model, new service method.
- Add another developer? The architecture documents itself through structure.
The $0.42 premium bought a production-ready foundation. Whether that's worth it depends on your definition of "done." To put this in CFO's terms, this is a ~ 50% cost increase if we just think of the tokens as coder hours but our architects, BAs, QA, and change management resources trend not to equal the cost of pure coders so the comparison falls apart if we compare to human labor vs AI labor costs as the 50% increase is trivial added impact to TCO compared to the alternative.
Where Did The Extra Tokens Go?
This is the number that stopped me cold.
V2.9 consumed roughly 9x more tokens than Claude alone for similar code output.
Same delivery time (18-19 min). Similar lines of code (360-380). Where did all those tokens go?
The answer: reviews and planning.
The workflow generated 7 vision elements, 11 work units, and 9 builder agent reviews. The agents weren't writing more code—they were thinking about the code. Validating approaches. Catching edge cases. Checking for input validation.
You're not paying for execution speed. You're paying for quality assurance upfront. To put this in perspective: that $0.42 is the cost of structured review. Whether that's expensive depends on whether you're comparing it to prototype budget or production-incident budget.
The Integration Bug That Shouldn't Happen (But Did)
Here's the lesson that genuinely surprised me.
V2.9's biggest bug was an integration gap. The sprint runner parallelized work units for speed—9 work units executed in 12.5 minutes of wall-clock time, achieving about 3.2x effective parallelism. But parallel work creates parallel ignorance. One work unit built an endpoint. Another built a service. Neither knew the other existed.
The irony: the feature I built for speed (parallel execution) created a new class of bugs (integration failures).
Root cause from the project report:
The sprint runner created independent work units that each worked in isolation but were not integrated. The FastAPI endpoint and LLM service were built as separate components with placeholder code, expecting future integration that wasn't explicitly assigned to a work unit.
The fix is obvious in hindsight: the Planner should generate explicit integration work units. But I didn't think of that until the bug bit me.
This is the pattern: reviews prevent careless errors, but architectural choices create new risks. The workflow didn't fail. My workflow design had a gap.
The Decision Framework
After running both approaches, here's the decision framework I'd use—with full knowledge that context matters more than rules:
If Your Code... | Use |
Will be deleted tomorrow | Claude alone |
Is for learning or exploration | Claude alone |
Needs to demo in 20 minutes | Claude alone |
Will be maintained next month | V2.9 workflow |
Will have multiple developers | V2.9 workflow |
Will add features over time | V2.9 workflow |
Goes to production | V2.9 workflow |
The inflection point is code lifespan. Short-lived code should be cheap. Long-lived code should be structured.
The Uncomfortable Truth
The workflow didn't eliminate judgment. It changed what you judge.
Without workflow: "Is this code good enough?"
With workflow: "Is this integration complete?"
Neither approach removes the human decision. They just move it.
Claude alone requires you to catch implementation errors in testing. The workflow requires you to catch integration gaps in planning. Both require you to actually verify that the thing works before declaring victory.
The $0.42 question isn't really about cost. It's about where you want to spend your coder's attention—debugging after the fact, or designing upfront.
What I Actually Learned
1. **Time was nearly identical.** The workflow's overhead was offset by parallelization. Neither approach is faster for small projects.
2. **Cost scales with thoroughness, not output.** 9x more tokens for similar code means reviews are expensive. Whether that expense is waste or investment depends on code lifespan.
3. **Bug count was identical.** Process doesn't eliminate bugs. It transforms implementation errors into integration gaps. Pick your preferred failure mode.
4. **Architecture was not identical.** This is the real difference. Same line count, completely different maintainability trajectory.
5. **The integration gap was my fault.** The workflow exposed a design flaw in my Planner: it doesn't generate integration tasks. That's on me, not the workflow.
The Meta-Lesson
I built a workflow to prevent AI from cutting corners. Then I ran an experiment to see if the corners needed cutting.
The answer is nuanced: corners should be cut for throwaway code, and shouldn't be cut for production code. The workflow isn't overhead when you need the structure. It's overhead when you don't.
The workflow can't answer the question that must come before it: "What am I actually building—and how long will it need to live?"
That's still my job. Turns out, that's the most important one.
Appendix A - Fixing v 2.9 did not require a workflow upgrade
How did I fix the integration failure mode you asked? Just like with a human project team, we added a new anti-pattern to the global pattern memory that was added in v2.8 so our vision and planner agents can recall this when they work.
Here's what it looked like when I asked v2.9 to store the lesson learned.
Write(/Users/user/.claude/patterns/PARALLEL-INTEGRATION-GAP-001.yaml)
⎿ Wrote 77 lines to /Users/user/.claude/patterns/PARALLEL-INTEGRATION-GAP-001.yaml pattern_id: PARALLEL-INTEGRATION-GAP-001
title: "Anti-Pattern: Parallel Work Units Without Integration Tasks"
version: '1.0'
type: anti-pattern
category: process
domain:
- universal
- workflow-automation
confidence: HIGH
priority: P1
description: |
Parallel execution of independent work units creates components that work
in isolation but aren't wired together. Each unit builds to its own spec,
using placeholder code or mocks, expecting "future integration" that never
gets explicitly assigned. Result: working parts that don't connect.
detection_signals:
- "Multiple work units touching different layers (API, service, frontend)"
- "Work units with placeholder/mock implementations"
- "No explicit integration or wiring task in the plan"
- "Components pass unit tests but fail integration tests"
- "UAT reveals 'X doesn't call Y' bugs"
ai_susceptibility: |
AI planners optimize for parallelization to reduce wall-clock time. When
breaking work into independent units, they assume integration is implicit
or will be handled "naturally." The AI doesn't recognize that parallel
ignorance compounds - each unit knows its own contract but not how others
implement theirs.
recommended_action:
- "When planning parallel work units that must connect, add explicit integration task"
- "Integration task should: wire components, verify data flows end-to-end"
- "Integration task depends on all component tasks completing first"
- "For N components that connect, need at least 1 integration task"
- "Consider: API + Service + Frontend = 3 components = need integration WU"
example:
bad: |
WU-001: Create FastAPI endpoint (returns placeholder)
WU-002: Create LLM service (callable function)
WU-003: Create React frontend (calls /api/chat)
# Result: Frontend calls API, but API doesn't call service
good: |
WU-001: Create FastAPI endpoint (returns placeholder)
WU-002: Create LLM service (callable function)
WU-003: Create React frontend (calls /api/chat)
WU-004: Wire endpoint to service, verify end-to-end flow
# Result: Integration task catches the gap before UAT
keywords:
- parallel-execution
- integration
- work-units
- planning
- components
- wiring
- end-to-end
evidence:
projects:
- name: v2.9-test2
details: "Sprint runner parallelized 9 WUs. FastAPI endpoint and LLM service built independently. Neither called the other. Discovered in UAT."
confidence_contribution: HIGH
usage:
suggested: 0
applied: 0
created: '2025-12-04'
updated: '2025-12-04'
maintenance:
status: ACTIVE
next_review: '2026-03-04'
Appendix B: V2.9 Workflow Features — Deterministic vs. Non-Deterministic Decision Making
The V2.9 workflow isn't just "more process." It's a deliberate separation of concerns: deterministic guardrails that enforce quality gates, and non-deterministic intelligence that makes judgment calls. This mirrors how traditional software teams operate—junior developers follow checklists while senior architects make design decisions. Incidentally, this is why CIOs for years now have been adopting CI/CD pipelines with security and code quality build into the pipeline. Now that we have LLM based agents that can understand and even write quality SDLC documentation to gain agreement between business, the many other stakeholders, and the developers we have opportunity rapidly build and deploy high quality code. We just cannot leave it all up to an LLM trained on average coders.
The Feature Set
Feature | Type | Purpose | Traditional SDLC Equivalent |
Three-Tier Hierarchy | Deterministic | Route work to appropriate review depth | Project manager triaging tickets |
Pre-commit Hooks | Deterministic | Block commits without required reviews | CI/CD gates |
Work Unit Tracking | Deterministic | Enforce atomic, archivable chunks | Sprint backlog items |
Graph Memory | Non-deterministic | Query past decisions before new ones | Senior dev's institutional knowledge |
Pattern Library | Non-deterministic | Cross-project wisdom retrieval | Company coding standards |
Agent Reviews | Non-deterministic | AI judgment on code quality | Code review by peers |
Greenfield Detection | Deterministic | Skip local memory when irrelevant | "This is a new project" decision |
Familiarity Scoring | Non-deterministic | Decide research depth per element | "Have we done this before?" |
Deterministic Components (The Guardrails)
These features enforce process without judgment. They're cheap to run—just config parsing and file checks.
Three-Tier Hierarchy: Epic → Story → Task. The tier determines review requirements:
- Epic: Planner + Tattle-Tale reviews (architectural scrutiny)
- Story: Sprint + Tattle-Tale reviews (implementation planning)
- Task: Builder review only (quick validation)
A typo fix (Task) doesn't need architectural review. A new authentication system (Epic) does. The workflow enforces this automatically.
Pre-commit Hooks: Commits are blocked unless:
- Required reviews exist for the tier
- Review artifacts pass frontmatter validation
- No secrets detected in staged files
This is the "cannot ship without signoff" gate that traditional teams enforce via PR approvals.
Work Unit Tracking: Every change is wrapped in a work unit with:
- Unique ID (WU-001, WU-002-01, etc.)
- Defined scope (files affected)
- Required reviews (based on tier)
- Archive on completion
This creates an audit trail and forces atomic, describable changes.
Non-Deterministic Components (The Intelligence)
These features require AI judgment. They're more expensive but provide the "thinking" that catches design gaps.
Graph Memory: A local SQLite database tracking:
- Architecture nodes (modules, classes, functions)
- Workflow nodes (work units, sessions, reviews)
- Decision nodes (ADRs, patterns, constraints)
- Relationship edges (IMPORTS, DEPENDS_ON, MODIFIED)
Before modifying code, the AI queries: "What depends on this?" Before creating a work unit: "Have we solved this before?"
Pattern Library: Cross-project wisdom stored in ~/.claude/patterns/. Each pattern documents:
- Detection signals (how to recognize the situation)
- Recommended actions (what to do)
- Evidence (which projects taught us this)
The new PARALLEL-INTEGRATION-GAP-001 pattern came from this exact experiment.
Agent Reviews: Specialized AI agents with focused prompts:
- Builder: "Is this task atomic? Are there edge cases?"
- Sprint: "Are these tasks correctly sequenced?"
- Planner: "Does this epic break into independent stories?"
- Tattle-Tale: "Do the reviews agree? What's the priority?"
Each agent returns P0/P1/P2 findings. P0 blocks progress.
Familiarity Scoring: For each element in a vision document:
1. Query graph for existing components
2. Query embeddings for similar past work
3. Query patterns for relevant wisdom
4. Score: High familiarity = light touch, Low = deep research
This prevents over-researching familiar territory and under-researching novel problems.
The Cost Tradeoff
Component Type | Token Cost | Failure Mode | Recovery Cost |
Deterministic | Near zero | Blocked commit | Fix and retry |
Non-deterministic | ~80% of workflow tokens | Missed design gap | Debug in production |
The V2.9 workflow spends tokens on non-deterministic judgment precisely because debugging production issues costs more than upfront review. The $0.42 premium is the cost of having AI do what a senior developer would do—think before coding.
Traditional SDLC Comparison
In a human team, you don't pay senior engineers to run linters. You pay them to:
- Review architecture decisions
- Catch integration gaps
- Remember past mistakes
- Ask "have we considered...?"
The V2.9 workflow separates these the same way:
- **Deterministic features**: Linting, gating, tracking (cheap, automated)
- **Non-deterministic features**: Review, memory, judgment (expensive, intelligent)
The key insight: you can't automate judgment, but you can automate the decision to invoke judgment. Tier selection is deterministic. What the reviewer thinks is not.
Appendix C: The Story From Git History aka Observability and Evaluation across experiments is key
Here's what actually happened in each approach, reconstructed from git logs and project artifacts.
Claude Alone (opusalone)
No git commits. The entire project was built in a single session without version control. The only artifact is the final code and the PROJECT_ASSESSMENT.md written afterward.
The Full Transcript (reconstructed from chat history):
Time | Claude's Action | What Actually Happened |
0:00 | User: "Create a basic web AI chat app" | Claude started planning |
0:00 | User interrupted: "use my tool in ../tool/llm_caller_cli" | Redirected to use existing CLI |
0:02 | Read llm_caller_cli README (318 lines) | Understood the tool's JSON request format |
0:03 | Wrote index.html (37 lines) | Chat interface markup |
0:04 | Wrote style.css (201 lines) | Dark theme with animations |
0:05 | Wrote app.js (116 lines) | Frontend chat logic |
0:06 | Wrote server.py (106 lines) | Python backend calling llm-call |
0:07 | chmod +x server.py | Made executable |
0:07 | Declared "done" | Listed the files, gave run instructions |
Then the user asked: "Didn't work. Did you test it?"
Time | Claude's Action | What Actually Happened |
0:08 | Started server in background | python server.py & |
0:08 | Tested homepage | curl http://localhost:8080/ — worked |
0:09 | Tested chat API | Got error: "Invalid model identifier llama-2-13b-chat" |
0:09 | Diagnosed: wrong model name | Server used DEFAULT_MODEL = "local-model" |
0:10 | Tested llm-call directly | With correct model name — worked! |
0:11 | Updated server.py | Changed to deepseek-coder-v2-lite-instruct |
0:11 | Also fixed response parsing | llm-call returns OpenAI-style choices array |
0:12 | Restarted server, tested again | Same error: llama-2-13b-chat |
0:13 | Confused: "The file looks correct" | Checked config files, searched for hardcoded values |
0:14 | Ran subprocess directly in Python | Worked perfectly |
0:15 | User hint: "are you calling it a CLI like the README says?" | Red herring — it was |
0:16 | Verified llm-call is installed | which llm-call — yes |
0:17 | Restarted server again | Same error |
0:18 | Finally checked: ps aux | grep server | Found PID 11459 from 7:59PM — old instance! |
0:19 | Killed old process, freed port 8080 | kill 11459 |
0:20 | Started fresh server | Works! |
0:21 | Verified with curl | Success response |
The Three Bugs:
1. **Wrong model name** (`local-model` → `deepseek-coder-v2-lite-instruct`)
2. **Wrong response parsing** (expected `status/response` but got `choices` array)
3. **Stale server process** (old instance on port 8080)
The Investigation-to-Fix Ratio: 2.5:1
Claude spent 10 minutes finding bugs and 4 minutes fixing them. The fixes were trivial once identified. The hard part was understanding why the code wasn't working when "the file looks correct."
What Didn't Exist:
- No git commits
- No planning document
- No review artifacts
- No work unit tracking
- No memory of the session after it ended
The "Planning" was Claude's internal reasoning, visible only in the chat transcript. Six months from now, there's no artifact explaining why deepseek-coder-v2-lite-instruct was chosen or why the response parsing looks the way it does.
V2.9 Workflow (v2.9-test2)
14 git commits. Each work unit created a commit trail.
The Full Transcript (reconstructed from chat history):
Phase 1: Vision Creation (~5 min)
Time | What Happened |
0:00 | User: /vision command |
0:00 | Claude: "What is the source? A) New feature B) Existing analysis C) Bug fixes D) Refactoring" |
0:01 | User: "A" |
0:01 | Claude: "Please tell me: 1. High-level goal? 2. Key components? 3. Files affected? 4. Constraints?" |
0:02 | User provided all 4 answers in one message |
0:02 | Claude: "Quick clarifications: Framework? CLI usage? Chat style? Project name?" |
0:03 | User: "React, CLI app, single Q&A, v2.9-test2-chat-PoC" |
0:03 | Claude: Read llm_caller_cli README (318 lines) |
0:04 | Claude: Drafted vision with 7 elements |
0:05 | User: "save" |
0:05 | Vision saved to .claude/vision.md |
Phase 2: Planning (~3 min)
Time | What Happened |
0:06 | User: /planner .claude/vision.md |
0:06 | Error: No venv, no pyyaml installed |
0:07 | Claude: Created venv, installed pyyaml |
0:07 | Planner ran memory queries for each element |
0:08 | Result: 7 elements, all LOW familiarity (greenfield), 21 context items from patterns |
0:08 | User: "proceed" |
0:09 | Planner generated 11 work units across 1 sprint |
The Work Unit Breakdown:
WU-001 Create FastAPI backend with /api/chat (Task)
WU-002 Implement subprocess call to llm-call (Story - parent)
WU-002-01 Plan: subprocess call (Task)
WU-002-02 Implement: subprocess call (Task)
WU-003 Parse JSON response from CLI (Task)
WU-004 Create React frontend with TypeScript (Task)
WU-005 Build simple chat UI component (Task)
WU-006 Add loading state (Task)
WU-007 Configure CORS and localhost-only (Story - parent)
WU-007-01 Plan: CORS config (Task)
WU-007-02 Implement: CORS config (Task)
Phase 3: Sprint Execution (~12.5 min wall-clock, ~40 min agent-time)
Time | What Happened |
0:10 | User: /sprint |
0:10 | Sprint runner launched 9 work units in parallel |
0:10 | Claude: Monitoring progress (7/9, 8/9, 9/9...) |
0:22 | All 9 work units complete |
Sprint Results:
Work Unit | Agent Time | Description |
WU-003 | 64.6s | Parse JSON response |
WU-007-02 | 83.0s | Implement CORS |
WU-001 | 228.0s | FastAPI backend |
WU-002-02 | 379.6s | Subprocess call |
WU-004 | 322.3s | React frontend |
WU-005 | 277.3s | Chat UI |
WU-002-01 | 452.1s | Plan subprocess |
WU-006 | 269.7s | Loading state |
WU-007-01 | 353.8s | Plan CORS |
Effective parallelism: 3.2x (40 min agent-time in 12.5 min wall-clock)
Phase 4: UAT and Integration Bug Discovery (~4 min)
Time | What Happened |
0:23 | User: "Start it up so I can UAT" |
0:23 | Claude: Started backend (port 8000 in use, switched to 8001) |
0:24 | Claude: Updated Vite proxy, started frontend |
0:24 | User: "Isn't working...test it" |
0:25 | Claude: Tested API → returns [Placeholder] Received question: |
0:25 | User: "Wire it in...why didn't you?" |
0:26 | Claude: "You're right - the sprint created the pieces but didn't wire them together" |
0:27 | Fixed: Added import, wired call_llm() into endpoint |
0:27 | Fixed: Response parsing (choices[0].message.content) |
0:28 | Restart backend → Working! |
The Integration Gap Explained:
The sprint runner created work units for each vision element:
- WU-001 created `main.py` with a **placeholder** response
- WU-002-02 created `src/services/llm_service.py` with `call_llm()` function
- WU-005 created `App.tsx` calling `/api/chat`
Each work unit passed its builder review. Each component worked in isolation. But:
- `main.py` never imported `llm_service`
- The endpoint returned placeholder text, not LLM responses
- The frontend called the API, which returned fake data
Why didn't the workflow catch this?
The planner created independent work units. WU-001 was scoped to "Create endpoint that accepts questions" — and it did. WU-002-02 was scoped to "Implement subprocess call" — and it did. Neither was scoped to "Wire endpoint to service."
The sprint runner optimized for parallelism. All 9 units ran simultaneously. Each unit completed its own objective. None knew what the others were building.
The fix required human discovery: "Wire it in...why didn't you?"
The Git Trail:
Commit | Work Unit | Description |
fa61aa9 | WU-001 | Create FastAPI backend with /api/chat POST endpoint |
d9468f6 | WU-001 | Archive work unit |
acd5cf9 | WU-004 | Create React frontend with TypeScript using Vite |
c3dab2d | WU-004 | Archive work unit |
fdb2933 | WU-002-02 | Implement subprocess call to llm-call CLI |
0843d71 | WU-002-02 | Archive work unit |
23c0338 | WU-002-01 | Plan: Implement subprocess call with builder review |
1583231 | WU-002-01 | Archive work unit |
3b3f6e7 | WU-005 | Build simple chat UI component |
c7672ba | WU-005 | Archive work unit |
e114710 | WU-006 | Add loading state while waiting |
cd14949 | WU-006 | Archive work unit |
bcfbef2 | WU-007-01 | Configure CORS and ensure localhost-only |
db6ed75 | WU-007-01 | Archive work unit |
Note: The integration fix (wiring llm_service into main.py) happened interactively after the sprint, not as a tracked work unit. This is the gap the PARALLEL-INTEGRATION-GAP-001 pattern now addresses.
The Telling Difference
Claude Alone: No breadcrumbs. If someone asked "why is it built this way?" the answer is "because Claude built it in one shot." No decision trail, no review artifacts, no archived work units.
V2.9 Workflow: Full breadcrumbs. Each work unit has:
- A commit with descriptive message
- An archive in `.claude/work-units/`
- Builder review artifacts in `.claude/agent-reviews/`
- Links to modified files in the graph (via MODIFIED edges)
Six months from now, someone could trace WU-005 back to the vision element "Build simple chat UI component" and understand why it exists.
Bug Discovery Timeline Comparison
Claude Alone | V2.9 Workflow |
Bug 1 found at t+11min (after "done") | Bug 1 (integration gap) found during UAT |
Bug 2 found at t+14min | Bug 2 (response parsing) found during integration |
Bug 3 found at t+21min | Bug 3 (port conflict) found immediately at startup |
All bugs: implementation errors | All bugs: integration/config issues |
The V2.9 bugs weren't "mistakes"—they were gaps in the plan. The workflow surface these earlier (during testing phases) rather than later (after declaring victory).
The Meta-Lesson
Claude alone produces code. V2.9 produces code and a decision trail.
For throwaway code, the decision trail is waste. For production code, the decision trail is the documentation that explains why the code exists and how it evolved.
The $0.42 / 50% upcharge in non-deterministic effort bought that trail.