May 6, 2026

Three Months to Get Back to 37 Minutes

Alternate Title: What Anthropic's .claude Permissions, Opus 4.7, and a Pentest Taught Me About Scaffolding Claude

Act I: The 37 Minutes

On April 6, 2026 at 12:58:25 in the afternoon, I committed an empty V3.1 workflow scaffold to a fresh project I called v3.1-test5. I typed one sentence into the orchestrator skill ”AI web chat MVP where users can register, log in, ask a question and get an AI answer, with an admin interface to manage users” pasted in a paragraph of success criteria, and watched.

I watched all 37 minutes and 31 seconds of it.

At 13:35:56 the same afternoon, the \[Vision Complete\] commit landed. Three work units delivered, parallel worktrees merged, one auto-resolved bug, a working Flask MVP with auth + admin + LLM Q&A, and ” because the security-planner skill was already wired in” a Threat Model section in the vision document with a route-authorization matrix the LLM had filled in itself. The chat MVP ran. Logging in worked. Asking the local LLM a question returned an answer. The admin page listed users. The whole thing was secure-enough by inspection.

37 minutes. One sentence in. Working app out. I sat there at my desk and just watched it work.

That was the high-water mark.

Two days later, on April 8, I filed BUG-378: Worktree storage in .claude/ triggers permission approval pauses move to prescientflow-artifacts/worktrees/A week after that, BUG-403: Worktree subagent edits to project files still trigger permission prompts. Within a month, around fifty bugs in the BUG-340..BUG-440 range existed for one reason: the autonomy that built the chat MVP in 37 minutes was no longer autonomous.

Anthropic had tightened the permissions on .claude/. They were right to do it. It still broke my dream of a mostly autonomous and solid team of agents supporting "vibe coders" who don't have the time or technical know-how to produce details specs and fine detailed design requirements assuring a nearer to production ready build.

Act II: Why PrescientFlow Exists in the First Place

If you're new to this blog, the workflow now has a name. PrescientFlow. It's the subject of most posts in this series since The $0.42 Question in December and I Caught Opus Cutting Corners Again in February. It's a workflow system built on top of Claude Code that takes a vision document and runs it through five-agent reviews, a memory-aware planner, dependency-batched parallel work units, post-sprint QA, and a [Vision Complete] commit at the end. It exists because in early 2025 I realized something simple and inconvenient: a workflow with hooks to validate the effectiveness of Claude's output is essential to actually getting useful code out of it because an AI trained on average code produces what a solid SDLC doesn't like “average code". Not nice-to-have. Essential.

I built PrescientFlow because I had a real project to build. Several, actually.

I'd taken on a couple of small projects to help friends with their challenges: internal tools, nothing big but not simple either, the kind of thing where the requirements were clear but the dependencies were tangled (e.g.Archiva, an agentic RAG helpful for knowledge workers processing complex analysis against complex requirements-coming soon). And I'd started building my future business site and service: Riskjuggler.ai. The site is the long-term thing. The friend projects were short-term obligations. I needed both done.

What I learned building the friend projects was that PrescientFlow had to grow up and learn from real complexity before it could touch Riskjuggler.ai. The friend projects had memory needs (What files have we already touched? What bugs surfaced before? What ADRs constrain this?), context-engineering needs (which slice of code does this work unit (WU) need vs. which slice would just burn tokens?), dependency needs (this WU has to land before that one). This forced the workflow to develop a graph memory store, a planner that reads it, a sprint orchestrator that batches by dependency.

The friend projects were the proving ground. By the end of March 2026 they were 80%+ done. PrescientFlow was finally ready. Riskjuggler.ai was next.

That's the project I was elbow-deep in when, in early April, the prompts started coming.

Act III: The Floor Moved (Twice)

The first time, I was working in Riskjuggler.ai. Claude Code started prompting me to approve every git command, every python3 invocation, every mkdir. The settings.json my own deployment script had generated didn't have a permissions block (I was used to being comfortable with - “dangerously-skip-permissions because I learned how to ask for work safely long ago) - it had skill definitions and hooks, sure, but no allow array. So every command was prompting. Approve. Approve. Approve.

I checked the Anthropic release notes. There it was: tighter scrutiny on .claude/ directory writes regardless of defaultMode, stricter default-deny posture for previously accepted commands. As a security professional, my reaction was great job, smart move - ” .claude/ contains skill definitions and hooks that shape agent behavior, and treating writes to it as privileged is exactly the right posture. As a workflow builder aiming for a safe and comfortable user experience, watching my orchestrator try to spawn worktrees inside .claude/worktrees/ and prompt me three times per work unit, my reaction was oh, this is bad.

The first floor move was the permissions change. The second was Opus 4.7 itself. It came out in the middle of all of this and it was, in my testing, more inclined to pause and verify my decisions than 4.6 had been. I suspect this is a model-training difference - ” RL'd behavior toward second-guessing - ” and --auto mode (which I'd already added to several skills for orchestrator-driven runs) became a fight against the model's defaults rather than a collaboration with them. It's a clean line I'll keep saying to myself: I'm clearly not building a workflow the same way Anthropic and others are designing for that still assumes the existence of other SMEs in the pipeline and specific tooling (e.g. Github PRs) and possibly the "Ralph Wiggam loop" to bang through possibles (and tokens) till it gets something that works. All that is fine for small objectives fixing parts of a large, existing solution but I wanted to build whole apps from a Natural Language, non-technical prompt if not a well defined Product Manager's spec.

I pushed the friend projects to 100% in two days after getting the orchastrator going, manually, dodging prompts. The slowdown on Riskjuggler.ai, the project the workflow was actually built to enable, was the part that hurt.

Act IV: Sand Pebbles, Not a Beach

If you go look at the git log of PrescientFlow for April-May 2026, you won't see one big "Anthropic Compatibility" pull request. You'll see roughly 280 bug numbers between BUG-340 and BUG-617 - ” sand pebbles, not a beach. Some of the high-leverage moves were:

  • Splitting .claude/ from prescientflow-artifacts/. The workflow plumbing -” skills, hooks, scripts, settings -” stayed in .claude/, properly scoped under the new permissions. The actual project artifacts -” vision documents, work units, reviews, QA reports, planner outputs, graph memory database, sprint history, completed work units, bugs - ” moved to a top-level prescientflow-artifacts/ directory that doesn't trigger the hardened scrutiny and is gitignored in case the builder does not want to commit them to their repos. This was BUG-340 through BUG-345 and a couple dozen siblings, executed over a week.
  • --auto mode on every skill that participates in the orchestrator pipeline. Vision had no --auto mode. Planner's approval gate fired before the orchestrator could suppress it. Sprint had a "Ready? Say start" prompt. QA asked "Proceed with QA?". Each one was a separate bug (BUG-346, BUG-347, BUG-348, BUG-358). Each one was a separate fight against the model's tendency to verify-before-acting, especially after Opus 4.7. None of them were hard individually. The hard part was that there were so many of them (and my memory system was not designed right for my workflow project-more about that later).
  • A deployment-time permissions block in settings.json. BUG-353 (April 4): "Project settings.json deploys without permissions block- users prompted for every Bash command." The fix was a dontAsk mode with an explicit allow-list of workflow commands and a deny-list of destructive ones. Unlisted commands are silently denied -” secure-by-default - but workflow commands run unattended.
  • A graph-memory uplift. This one I didn't expect. The workflow's graph memory had been built for code - modules, functions, imports, calls. When PrescientFlow's own workflow-dev repo started feeding the planner, the planner kept producing weaker plans for skill-and-script-heavy work units than it did for app-code work units. The reason was simple: the graph schema was code-shaped. So we generalized it - workflow-dev projects now populate skill nodes, hook nodes, ADR nodes, and the planner gets the same richness of context for an MCP-removal sprint that it gets for a Flask MVP. Until that fix landed, a lot of bugs surfaced manually that should have been caught at plan time.

That's three architectural moves and one I-didn't-see-it-coming. Not a complete list. The full list lives in the bug archive.

The other strategic addition during this period was the security-planner.

I'd been testing Dan Miessler's Personal AI Infrastructure alongside my own workflow. Dan is a super-experienced InfoSec pro and PAI has a built in pentest capability I could use. I pointed it at my Riskjuggler.ai build. It found quite a few failures (multiple Criticals and Highs!!). Things that wouldn't look good on a security and IT pro's site. That was the catalyst.

I'd known for a long time that running a security review on every work unit alongside the regular Vision/Scope/Design/Testing/Tattle-Tale gate would be heavy token burn for low ROI on internal-only projects-same as in a human SDLC. So the security-planner was added conditionally. It fires when the vision document has a Threat Model section or when work-unit titles mention auth, payment, session, login, password, PII. Internal-only friend projects: it doesn't fire. Internet-exposed Riskjuggler.ai (coming soon): it fires. The skill loads ADR-SEC-questions.yaml, makes one narrow LLM call per applicable question per WU, and writes findings to walkthrough_findings[] in the planner output. Phase 2.5.5 of the orchestrator routes P0 findings into amended work units; Phase 2.5.6 routes P1 into auto-created follow-ups.

This was the security gate I built specifically because the average Claude-generated code is exactly what you'd expect: average code that isn't secure by default, since security is rarely taught or at least not taught thoroughly in CS programs and only gets layered in when devs get feedback from an InfoSec professional. I built the InfoSec professional scaffold into the workflow.

That gate, it turns out, had a P0 bug.

Act V: The Pentest That Caught My Gate

Yesterday -” May 5, 2026 - I pointed PAI at a fresh deployment of v3.1-test30, a chat app the workflow had built end-to-end on a HTTPS port. The pentest report came back with 0 Critical, 1 High, 2 Medium, 3 Low, and 22 controls clean.

The High was "Session not invalidated on logout." You log in, capture the session cookie, hit POST /logout, then re-issue the captured cookie against /admin. The server returns HTTP 200 with the admin dashboard, because logout deletes the client-side cookie via Set-Cookie: session=; Max-Age=0 but maintains no server-side revocation. A captured token survives logout for the full 8-hour Max-Age.

It was more than one HIGH if I'm being honest with you about what the model produces by default. The session-replay was the headline. There were also Mediums for prompt-injection on the LLM endpoint and for TLS 1.2 only offering a non-AEAD cipher (CBC + HMAC-SHA1, no AES-GCM). Lows for weak password policy, missing HttpOnly on the CSRF cookie, open registration. The full report is in /Volumes/claude/security-reports/.

My first reaction wasn't "the security-planner missed this". It was "the security-planner should have caught this. Let me trace it."

Long story short I just had to add a better question and fixed a bug that was causing the planner not to read the entire security planner guidance.

The bigger lesson is the one I keep coming back to. The security-planner gate, when it works, asks most of the right questions. The gate working exposes the next problem: even prompted to - think hard -  even with the planner's culture of outcome-based titles and lean prompts and atomic transitions and a Tattle-Tale reviewer that synthesizes the other four perspectives, Claude isn't reinforcement-learned to threat-model thoroughly without scaffolding. When asked - can a captured token be replayed after logout? - via a structured walkthrough, the model gives you a usable answer. When asked "build a chat app with auth", the model gives you a chat app with auth that doesn't invalidate sessions on the server side. The reviewer skills -” Vision (right problem?), Scope (atomic, in-bounds?), Design (patterns, no regressions?), Testing (falsifiable coverage?), Tattle-Tale (synthesis) -” won't catch it on their own. They weren't designed to.

The continuous human-curated security checklist isn't a feature of the workflow. It's the deal.

The Deal

Anthropic's permission tightening was right. My workflow's autonomy promise was the wrong abstraction. Three months of recovery taught me that scaffolding the model isn't optional -” even with a security gate, an Opus 4.7 RL'd to think hard, and a planner that values atomic specs and outcome-based titles, the model produces average code that isn't secure-by-default. The pentest that exposed BUG-607 was the proof. Continuous human-curated security checklists aren't a feature. They're the deal.

Two commitments:

The security checklist gets published and maintained. ADR-SEC-questions.yaml started with 30 questions across four ADRs. It's going to grow every time a pentest finds something the existing question bank should have caught -“ or when claude enables a better model or skill to call to take care of this for us. The WU-2196/2197/2198 follow-ups added Session Lifecycle bullets to the vision template - Token storage, Storage trade-off rationale, Constant-time comparison, Transport (TLS) coupling - because SEC-002-Q1 and SEC-002-Q8 came back addressed=no in the walkthrough and we didn't have the language in the template to fix them at the source. That ratchet is going to keep tightening. I'll post the question bank publicly as it evolves.

Riskjuggler.ai gets built next. The reason I'm writing this post is that PrescientFlow is finally -” finally -” back to being able to do what it did on April 6 in 37 minutes, except now with the security-planner findings actually persisted, the threat model, security planner, and red-team spot-checks, the UAT-lead's auth-flow attack-pattern TC bank that auto-injects logout-replay and session-fixation tests whenever a vision exposes auth routes, and the worktree merger that auto-fixes mechanical archive drift.

All of which is to say: I'm not trying to change the world. I'm trying to enable vibe coders -” folks who can describe what they want but can't yet build it themselves -” to deliver more complete and more production-ready demos than they could before, at a cost that's at least defensible against the alternative of just hiring it out.

What you'll get out of this workflow, if you adopt it, isn't a guarantee of secure code. It's a structured production of work-unit specs, test plans, QA documentation, and architecture artifacts that a skilled production engineer can pick up and reuse because it's doing what we've always done-managing IT risk through a workflow that supports the devs with perspectives they didn't get on their own while deterministically validating their output to assure quality and security. If you need it, the product-manager and architect skills will generate common documentation that feeds the SDLC intake on the human side. The QA test plans feed the QA team. The non-technical vibe coder (or just the super-busy technical specialist needing time for other priorities like me) gets to ship a working demo and the prod engineer doesn't start from scratch -” they start from a spec, a test plan, and code that's already passed five-agent review.

That's the offer. It's smaller than autonomy. It's also more honest about where the model actually is.

37 minutes is back. The thing I built it for is next.

Technical Appendix A: The Recovery in Numbers

Metric

Pre-storm (v3.1-test5, April 6)

Post-storm (current, May 6)

Bugs filed since Feb 6 post


~280 (BUG-340 - BUG-617)

Bugs P0 severity in security-planner persistence path

0 known

1 (BUG-607, fixed via WU-2184)

Vision Complete cycle time, single-WU sprints

~37 min (auth + admin + LLM Q&A)

Comparable; security-planner adds ~3 min

Reviewer perspectives per WU

5 (Vision/Scope/Design/Testing/Tattle-Tale)

5 (unchanged)

Conditional gates after Phase 4 (QA)

Smoke test, Red-team, Fix-bugs micro-sprint

Same, with telemetry-emitting dispatcher for 4.6 (BUG-611)

Threat-model questions in ADR-SEC bank

Initial 30

30 + 3 P1 follow-ups landed (Token storage, Storage trade-off, Constant-time comparison)

Red-team runtime checks

4 (timing, headers, verb-disclosure, CSRF tokens)

8 (added session-replay, TLS-AEAD, prompt-injection canary, HttpOnly on auth cookies - ” BUG-608)

The full PrescientFlow codebase is at https://github.com/Riskjuggler/PrescientFlow. 

Apr 1, 2026

Building Archiva: Nine Months on the RAG Frontier

On January 5, 2026, I ran my first proper evaluation against a 30-query test suite and got back a number that stopped me cold: 33% Top-1 accuracy. The commit message reads “CRITICAL REGRESSION (33% Top-1 vs 80% target).” One in three queries returning the correct answer. For a system I had been building for four months.

That number was the most useful thing the project had produced to that point. Not because it was good. Because it was real. Everything before it had been intuition and demo-quality spot checks. Everything after it would be measured.

Nine months, 4,582 commits, 1,801 work units, and 249 archived bugs later, that 33% became 87%. This is the story of how, and what I got wrong along the way.

Act 1: Before We Knew What We Didn’t Know (June-September 2025)

The idea started taking shape around mid-2025. I spent a couple of months researching RAG architectures, reading papers on hybrid retrieval, experimenting with embedding models. By the time I created the repository in late September, I had convinced myself I understood the problem. I did not. But I had enough momentum to start building, which turned out to matter more.

The first commit landed on September 21, 2025 – fourteen commits in a single day, twelve modules scaffolded between 12:53 PM and 4:55 PM. document_assistant, document_categorizer, document_knowledge_graph, query_classifier, pdf_embedder, docx_embedder, hybrid_search, fts_indexer, a knowledge graph builder, an entity extractor, a training platform.

The idea was straightforward: take a pile of PCI-DSS compliance documents, chunk them, embed them, and let users search with natural language. Vector search plus BM25 keyword matching, combine the scores, return results.

It was a lot of infrastructure for a search box. But nobody told me (and I would not have believed them if they had) that the search box is the simplest part. Everything between the user pressing Enter and seeing results – the query rewriting, the intent classification, the candidate retrieval, the score fusion, the reranking, the chunk expansion – that is where the engineering lives. And that is where the bugs live too.

52 commits by the end of September. An end-to-end integration by the 27th. A client orchestrator backend by the 29th. It felt like progress.

Act 2: The Rewrite (October-November 2025)

October was humbling. Besides Opus 4.5 launching and materially changing my workflow (see The Guardrails I Built to Stop AI From Breaking My Code (And Why I Needed Them)) my workflow produced 388 commits, nearly all of them architecture analysis and remediation. I ran the codebase through a multi-batch architecture review – eight batches, phases 1 through 40, with three-agent review panels. The results were not kind. Schema inconsistencies between modules. The embedder format needed migration to a unified schema. The CLI needed significant remediation. The price of building with raw Claude and not wanting to put a ton of effort into PRDs to guide it because I was on a quest (multiple quests actually) to add an abstraction layer that would help “vibe coders” produce better outcomes.

This is when “Archiva” got its name. October 24, buried in an OSS readiness report, the project stopped being “document-processing-modules” intended to accelerate my most common workflows in compliance and risk assessment and became something with an identity. This is also when the first work units appeared – not yet the numbered format that would come later, but named identifiers: WU-AUDIT-001, WU-CLI-005 through WU-CLI-013. My claude workflow was finding its shape.

November continued the restructuring. 436 more commits. document_assistant became output_orchestrator. hybrid_search moved to retrieval_orchestrator. The prompt management system got centralised. The OutputFormatter infrastructure landed. None of this was visible to users. All of it was necessary.

This is the phase that separates projects that ship from projects that demo well once. And I will be honest: it was tedious. Two months of renaming modules and migrating schemas does not make for exciting commit messages. But a system you are going to iterate on for six more months needs consistent naming and centralised configuration. The embedding mismatch bug proved that (more on this shortly).

The work units were still using named IDs – WU-SEARCH-001, WU-PROMPT-xxx, WU-014B-xxx. The formal sprint system did not exist yet. I was managing work through ad-hoc planning and manual tracking. It worked. Until December, when it very much did not. Along the way the workflow began to accelerate (and fail faster) - see From "Got an error. Please investigate" to Building a Production App in 2.5 Hours: My AI Engineering Evolution – and got memory - see  How I Gave My AI Agents a Memory—And Built a Full-Stack App in 1 Hour - the essential workflow component that made completing Archiva, with it’s high complexity and need for matching what goes in to what comes out of a different piece of code, possible.

Act 3: The December Explosion (December 2025)

1,841 commits in 31 days. Nearly 60 per day. Four times the volume of any other month. Most nights after work with my laptop on my lap watching TV with the family just sitting through the much longer autonomous workflow enabled wait-state till it needed me to activate the next stage.

Two things happened simultaneously: the sprint validation system went live on December 15, and work units switched to numeric IDs starting at WU-050. The sprint-validation.jsonl log tells the story – 1,534 entries spanning December 15 through January 26, capturing every work unit execution from WU-076 to WU-724.

The early entries are full of “No commit hash found in agent output” warnings and “Expected file missing: TBD” messages. The validation system was bootstrapping itself while validating real work. By late January, those same entries show model metadata, execution durations, fallback chains, and quality scores. The workflow self-learned to run by walking.

The Embedding Mismatch

One of the earliest bugs was also one of the most instructive about the weakness that has to be resolved either by an experienced and diligent human describing the goal in simple terms OR a workflow that can expand a problem statement into those simple, perfect sized chunks that make AI decisions efficient, accurate, and less non-deterministic. So what happened? Because each piece was built separately without giving Claude a lot of input, the ingestion pipeline was using nomic-embed-text to create embeddings and the search pipeline was using snowflake-arctic-embed to create query embeddings. Two different models. Incompatible vector spaces.

The system “worked.” It returned results. They were just wrong. Not spectacularly wrong – wrong enough to feel plausible, which is worse. If your search returns garbage, you know it is broken. If it returns plausible-but-incorrect passages, human biases cause us to trust it and end up making bad decisions (why an AIGP or at least effective AI governance is essential for success not just passing compliance audits).

The fix was one line in a config file. The damage was weeks of building on a broken foundation without knowing it. Architecture Lesson Learned: Centralise your model configuration, or prepare to debug phantom failures that look like algorithm problems but are actually config problems.. And of course, make sure your AI coder can see all the dependencies.

Act 4: The 33% Wake-Up Call (January 2026)

January brought 472 commits and a steady progression from WU-500 through WU-724. Threshold calibration, entity extraction, compound query handlers. The kind of work that feels productive.

Then, on January 5, WU-616: the first proper 30-query evaluation. 33% Top-1 accuracy. The commit message includes “CRITICAL REGRESSION” in caps, which is git for screaming.

The number was not a regression from a known good state. It was the first time I had measured the state at all. The “regression” was from my assumption about how well the system worked to the reality of how well it actually worked. Those are different things, and the gap between them was 47 percentage points.

This was the moment the evaluation framework went from “nice to have” to “the only thing that matters.” I had been building features for four months. I should have been building measurements. It’s also when I added BUG records to the memory and required the planner to look over BUGs as well as the graph DB to build the work unit context to avoid loops where claude will just keep experimenting and often repeating itself trying to work out the real root cause of a complex failure.

The sprint validation log ends on January 26 with WU-724 – “Tier classification confidence thresholds align with snowflake-arctic score ranges.” That was the last work unit under the old workflow system.

By January the workflow achieved version 2.9.4, its eighth major iteration in four months. I had iterated the workflow almost as aggressively as the search system because it was the only way to get effective and consistent progress toward a complex target.

Act 5: Re-platforming the Factory (February 2026)

Eight Versions in Five Months

A brief archaeology, because the V3.0 story does not make sense without it.

V2.2 (late September 2025) introduced structured work units and seven-agent reviews. V2.3 added hard schema validation. V2.4 added sprint orchestration. V2.6 brought an embedding-based memory system. V2.8 went lean – cut the script count from 45 to 12, trimmed reviews from seven agents to five. V2.9 added graph memory, the three-tier hierarchy (Epic/Story/Task), and a dedicated Planner agent.

By the time V2.9.4 shipped in late December, the workflow had grown into a respectable custom framework: 136 files in the scripts directory (95 Python, plus shell scripts and supporting files) totalling about 2.6MB, a 241KB sprint orchestrator (sprint_runner.py), agent prompt templates stored as markdown files, and – critically – a monolithic bash pre-commit hook of nearly 1,000 lines that served as the primary deterministic enforcement gate.

That pre-commit hook was the keystone. A thousand lines of bash parsing work unit IDs from commit messages, looking up work unit files to determine their tier, checking for review files in .claude/agent-reviews/, validating config version fields, verifying LM Studio availability, and enforcing tier-specific review requirements. Everything ran through git commit. If you could not commit, the gate held. If you could commit, everything was assumed to be fine.

A git-commit-oriented gate only catches problems at commit time. It cannot prevent an LLM from generating bad content in the first place. It cannot validate a plan before implementation begins. It cannot check whether a vision document has file paths before Claude spends an hour implementing it. The gate was at the end of the pipeline, not along the way.

BUG-016 was the proof. A work unit that “fixed” a threading bug had actually reverted a previous fix. The pre-commit hook checked that review files existed – and they did. The reviews said “looks good.” The regression went undetected for 51 days across 19 work units. The gate passed because the gate only checked structure, not semantics. And the reviews passed because the same LLM that wrote the code was reviewing the code, with no separation of duties enforced by the infrastructure.

What V3.0 Actually Changed

V3.0 did not invent work units, reviews, or sprints. Those had existed since V2.2. What V3.0 did was re-platform the enforcement mechanism from custom scripts and a monolithic git hook to Claude Code’s native primitives.

The mapping:

V2.9 (Custom)

V3.0 (Native)

136 scripts (~2.6MB)

~37 focused Python validators

sprint_runner.py (241KB orchestrator)

/sprint skill + Task tool

Markdown agent templates in ~/.claude/templates/

Skills in .claude/skills/ (19 skills)

Bash pre-commit hook (~1,000 lines)

Claude hooks + lean git hooks + orchestrate.py

Custom memory scripts

MCP graph-memory server

The philosophical shift was deeper than the tooling change. V2.9’s enforcement was git-centric: “can you commit?” V3.0’s enforcement is validator-centric: “did the deterministic check pass?” Each validator is a focused Python script under 50 lines – validate_vision.py, validate_planner.py, validate_sprint.py – coordinated by orchestrate.py in validate-remediate loops. If validation fails, the orchestrator formats the error and sends it back to Claude for remediation. The LLM never sees the validation logic. The validation logic never trusts the LLM.

The deeper principle is one I keep coming back to: deterministic validation of non-deterministic outputs. Claude generates code; a Python script validates it. Claude proposes a plan; a schema checker verifies the required fields. Claude writes a commit; a pre-commit hook checks for scope creep and secret exposure. The LLM creates. The script validates. The script does not have a context window. The script does not cut corners at 100K tokens. The script does not get talked out of its rules.

V3.0 also reinstated the full five-agent review gate at all tiers (Vision, Scope, Design, Testing, Tattle-Tale), enforced by policy after BUG-016. V2.9 had optimised reviews down to “builder-only” at the task tier – faster, yes, but that optimisation is what let the 51-day regression through. Bureaucracy must exist to manage chaotic good/neutral/evil actors in any system, right?

The Migration

The V3.0 workflow was built in its own project – improvingclaudeworkflow-v3.0 – across 60 Claude Code sessions. Meanwhile, the tools/llm-caller-cli project consumed another 26 sessions for the shared LLM calling infrastructure. 177 Claude Code sessions total across all related projects. (Yes, I counted.)

The gap in Archiva’s work unit IDs tells the story of where my attention went: WU-724 (January 26) jumps to WU-1406 (February 16). Those 680 missing IDs were consumed by the workflow system’s own development. Building the factory before resuming production.

On February 16, the commit message reads: “Save work before V3.0 migration.” Then the new infrastructure starts landing: workflow upgrades from 3.0.1 to 3.0.26, graph memory rebuilt with 2,620 nodes and 1,614 edges. The old .claude/ directory – all 97 scripts, the bash hooks, the agent prompt templates – was backed up to .claude-bak/ and replaced wholesale.

Act 6: The Long Climb (March 2026)

808 commits. This is where Archiva went from “demo that sometimes works” to “system I would let an auditor use.”

The Scoreboard

Two different eval scales matter here, and I am going to be explicit about which is which (because mixing them up would make me look better than the data supports).

30-query evals (the early, fast checks):

Sprint

Date

ET1

Semantic

Compound

Precision

11

Mar 6

60.0%

25.0%

54.5%

81.8%

17

Mar 8

56.7%

17b

Mar 8

60.0%

25.0%

54.5%

81.8%

22

Mar 9

66.7%

62.5%

54.5%

81.8%

100-query evals (the real measure, available from Sprint 22 onward):

Sprint

Date

ET1

Semantic

Compound

Precision

Key Event

22

Mar 9

68.0%

64.1%

69.0%

78.1%

ms-marco reranker swap

30

Mar 13

~62%

59.0%

58.6%

71.9%

Baseline crash (re-ingestion)

42

Mar 17

67.0%

61.5%

58.6%

81.2%

BUG-PRECISION-001 fix

Post-fixes

Mar 21

85.0%

82.1%

79.3%

93.8%

5 cumulative bug fixes

Post-perf

Mar 24

87.0%

82.1%

79.3%

96.9%

Entity extraction skip (8x faster)

That climb from 33% (January) through 68% (Sprint 22) to 87% (post-performance) took three months and roughly 30 sprints. Each percentage point was a specific bug found, diagnosed, and fixed one at time to assure each fix was the right fix to make. There were no magic bullets and just grunting through (while the workflow kept improving and reducing the amount of hands-on from me with longer running autonmous sessions producing well validated functionality).

The Sprint 30 Crash: When Better Chunking Made Things Worse

Sprint 30 is the story I almost did not want to tell, because the mistake was so obvious in hindsight.

I improved the chunking strategy. Adaptive chunk sizing, 100-500 tokens, preserving section boundaries instead of cutting at arbitrary character counts. Objectively better. I re-ingested the entire corpus with the new chunking.

ET1 dropped from 68% to 62%. Six percentage points, gone.

The chunking was better. The ground truth was wrong. Every query in the eval suite had been mapped to specific chunks under the old layout. New chunks had different boundaries, different IDs, different content splits. A query that used to match chunk 47 now needed to match chunk 312, but the ground truth still said “chunk 47.” The eval was measuring whether my system returned chunks that no longer existed.

So we built a ground truth regeneration script that uses word-boundary regex matching to find each expected passage in the new chunk layout. It auto-fixed 270 of 300 queries. The remaining 22 needed manual review – their expected content had been split across chunk boundaries in the new layout, and no single chunk contained the full answer.

Your ground truth is coupled to your chunking strategy. Change one, update the other, or your eval results are measuring a fiction. I lost a week to this. The fix was not in the search algorithm. The fix was in the test harness.

The Sprint 17 Regression: Hubris, Meet Data

Sprint 17 is my favourite failure because it happened and was fixed in a single day (March 8), and because we should have seen it coming.

The reranker (BAAI/bge-reranker-base) was applying a double-sigmoid normalization that compressed all scores into a narrow range. We removed it. Mathematically correct – the reranker scores should flow through raw, preserving signal fidelity.

Accuracy dropped 3.3 percentage points immediately. Same-day eval confirmed it.

What we had not accounted for: the blend weights downstream were calibrated to the compressed score distribution. Remove the compression, and the reranker’s raw scores – which were already near-uniform around 0.50 – got amplified instead of dampened. The “better signal” was actually more noise, fed into a system tuned for the old noise profile.

The fix (also March 8, Sprint 17b) was not to restore the double-sigmoid. It was to reduce the reranker’s influence in the blend weights. Semantic blend went from 0.35/0.65 to 0.55/0.45. Less weight on a weak signal is better than normalizing a weak signal to look strong.

In a fusion pipeline, you cannot change one component’s output distribution without recalibrating everything downstream. Every stage trusts the statistical properties of the stage before it. Change those properties, and “improvement” becomes regression. We knew this in theory. Sprint 17 made sure I knew it in practice.

Five Bugs That Made 85% Possible

The jump from 67% (Sprint 42) to 85% was not one fix. It was five fixes applied in sequence over four days, each building on the last:

  1. BUG-EVAL-072: PCI-DSS acronyms in queries triggered precision/detail intent classification, which blocked the semantic rewriter from running at all. Semantic queries about PCI-DSS topics were being treated as exact-match lookups. The fix: exclude standard-name-only patterns from the semantic rewrite blocker.

  2. BUG-EVAL-071: The semantic rewriter was replacing query keywords with generic corpus vocabulary. “Key management” became “cryptographic operations.” The fix: preserve original query terms in the rewritten output, always.

  3. BUG-EVAL-074: Ground truth for one query was mapped to the wrong chunk. The eval said we were failing, but we were actually returning the right answer. (Even the eval can be wrong. Especially the eval can be wrong.)

  4. BUG-SEARCH-004: Domain-agnostic concept synonyms. “Hardening” should also match “configuration standards.” “Authentication” should also match “identity verification.” A synonym expansion table, queried at search time.

  5. BUG-SEARCH-005: FTS5 queries were crashing on commas and colons in user input. Not returning wrong results – crashing silently and falling back to vector-only search. Twenty-two queries were affected. The fix: sanitize punctuation before constructing FTS5 query strings. Also lesson learned: *** Don’t fail open when it really matters ***

None of these were glamorous. None of them would make a good conference talk. All of them a more experienced Engineer probably would have known to prevent. But together they moved the needle 18 percentage points.

Lesson Learned: RAG accuracy is not about the algorithm or model you choose. It is about the plumbing. Natural Language is a source dependent construct so your plumbing has to enhance and avoid assumptions if your model doesn’t have the experience to compensate.

The 300-Query Reality Check

After hitting 85% on the 100-query eval, I ran the full 300-query suite. 82.6% ET1, 92.6% Top-3, 96.0% Top-5, 98.7% Top-10.

The drop from 87% (100q) to the low-80s (300q) is expected – larger eval suites include harder queries and more edge cases. But here is the thing I have to be honest about: the 300-query ET1 has hovered between 80.9% and 82.6% across 11 runs. It does not move in a meaningful direction. Whether that is a plateau I can climb past or a ceiling imposed by the architecture depends on whether you think a better embedding model would change the physics of the problem.

Four queries persistently MISS. pci_q126 asks about “authentication” and retrieves “Sensitive Authentication Data” (PCI Requirement 3) instead of “User Authentication” (Requirement 8) – the same word means different things in different parts of the standard (WT_, am I right?). pci_q172 flips between TOP-10 and MISS across runs because of LLM rewrite non-determinism. About 12 queries show this kind of tier-boundary flipping: ET1 varies 81.3-82.6% for the exact same 300 queries depending on how the semantic rewriter phrases things. That variance is a fundamental property of any pipeline that includes an LLM, and I do not yet have a good answer for it–except maybe reaching out to the Council to suggest wording improvements under the premise that AI is always going to struggle if your input data isn’t clean and well organized.

Performance: The Embarrassing Fix

For most of March, the 300-query eval took 6 to 16 hours to complete. Not because the queries were slow. Because I was spawning a new Python subprocess for every LLM call. Every query rewrite, every entity extraction, every answer generation: fork a process, load Python, import the modules, make the call, exit. My initial premise as I designed my tools was that modularizing common components would be essential to building more complex solutions. Lesson Learned: As any better trained Engineer/Solution Architect would have told me, abstraction = tradeoffs.

The fix was exactly what you think it was. Stop forking. Import the module directly. Call the function.

  • Before: p50 latency ~10,000ms per query

  • After: p50 latency 1,251ms per query

8x faster. The 300-query eval now completes in 30 minutes. And accuracy went up 2 percentage points. The entity extraction step – the one adding 5-15 seconds per query via subprocess – was not just slow. It was adding noise to the retrieval signal. Removing it made the system both faster and more accurate.

If an expensive preprocessing stage does not measurably improve retrieval, it is not enrichment. It is waste. Measure before you assume.

Act 7: What I Learned About Claude

The search system is half the story. The other half is the thing building the search system.

The Context Window Is a Pressure Gauge

I have written about this before, but it bears repeating because I watched it happen repeatedly during Archiva development. Around Sprint 8, I noticed that builder agents were producing thinner test coverage toward the end of long sessions. Reviews got less thorough. Commit messages got vaguer. The quality degradation was gradual enough to miss if you were not tracking it.

The V3.0 workflow system’s response was architectural: keep each work unit small (one task, one commit, one measurable outcome) and move validation into deterministic scripts that do not have context windows. The pre-commit hook does not care how long the conversation has been. It checks the same rules at token 5,000 and token 150,000.

The broader principle is not really about Claude specifically. It is about any system where quality is a function of resource pressure. Humans get tired. LLMs get compressed. Both need guardrails that are independent of their current capacity.

The Two-Commit Pattern

One of the most painful bugs in the project’s history was WU-656. Implementation committed at 11:04 AM. Plan review generated at 11:14 AM. The review caught P0 issues – nine minutes after the code was already archived.

The fix was the two-commit pattern. Commit 1: work unit definition and plan reviews (before implementation starts). Commit 2: implementation code and output reviews (after implementation, only if reviews pass). This creates an unforgeable timeline in git. You can prove that reviews ran before code was committed.

Is this overhead? Yes. Essential bureacracy is a necessary cost of good business. But WU-656 proved that post-hoc reviews are security theater. If the review cannot block the commit, the review is a report, not a gate.

Act 8: What I Learned About RAG

The Retrieval Pyramid

After nine months, here is my mental model for RAG accuracy, from most to least impactful:

  1. Chunking strategy (how you split documents determines what can be found)

  2. Evaluation framework (you cannot improve what you cannot measure)

  3. Query understanding (intent classification, synonym expansion, query rewriting)

  4. Fusion and ranking (how you combine and order candidates)

  5. Embedding model quality (necessary but not sufficient)

  6. Answer generation (the LLM at the end matters least for retrieval accuracy)

Most RAG tutorials I’ve seen start at the bottom of this list because simple RAG is just about the vector DB. They focus on which LLM to use for generation, which embedding model to pick, maybe which vector database. Those things matter, but they are not where the bugs are. The bugs are in chunking, intent classification, and FTS5 punctuation handling. Always the plumbing. Vectors alone cannot give you high accuracy with Natural Language queries. Accuracy comes from enhancing the context to “understand” how to ask for the best chunks.

Intent Classification Was the Biggest Single Win

The jump from the 30-query era (semantic at 25%) to Sprint 22 (semantic at 64.1% on the 100-query eval) came from two changes: swapping the reranker to ms-marco-MiniLM-L-6-v2 and enabling intent-aware pool sizing.

A “precision” query (“What does Requirement 8.3.1 say?”) needs a small candidate pool and exact-match boosting. A “semantic” query (“How does PCI-DSS address encryption at rest?”) needs a large candidate pool and semantic similarity scoring. Without intent classification, every query gets the same treatment. With it, each query type gets the retrieval strategy it needs.

On the 100-query eval: precision went from 78.1% (Sprint 22) to 93.8% (post-fixes). Semantic went from 64.1% to 82.1%. Same documents. Same embeddings. Different retrieval strategy per query type.

Fusion Pipelines Are Coupled Systems

Reciprocal rank fusion is supposed to be elegant. Combine ranks from vector search and keyword search with a simple formula, done. In practice, every parameter is coupled to every other parameter. Change the reranker model, the blend weights are wrong. Change the pool size, the fusion scores shift. Remove a normalization step, everything downstream breaks. Sprint 17 proved this (remove one sigmoid, lose 3.3 percentage points), and BUG-EVAL-071 proved the corollary: even the query rewriter’s choice of synonyms propagates through the entire pipeline. “Key management” and “cryptographic operations” retrieve different candidate sets, which produce different fusion scores, which produce different rankings. One word substitution in the query, different answer at the end. In a fusion pipeline, nothing is local.

The Current Scorecard

Archiva v3.0.26. March 2026.

Metric

Value

Effective Top-1 (100q)

87.0%

Effective Top-1 (300q)

80.9%-82.6% (across 11 runs)

Top-3 (300q)

92.6%

Top-10 (300q)

98.7%

p50 latency

1,251ms

p95 latency

7,099ms

Total commits

4,582

Work units executed

1,801

Bugs archived

249

Sprint plans completed

64

Claude Code sessions

177 (across all related projects)

Persistent MISS queries

4 out of 300

Four queries MISS. One is a genuine homograph problem (the word “authentication” means different things in different PCI requirements). One flips with LLM non-determinism. Two are at the embedding model’s semantic boundary. They are not bugs to fix. They are the current limits of the architecture.

The 82.6% number on 300 queries has not moved in 11 runs. I do not know yet whether that is a plateau I can climb past with a better embedding model, or a ceiling imposed by the fundamental approach. Acknowledging what you do not know is (I fortunately learned long ago during my Philosophy days) more useful than pretending you have a plan or answer for everything.

What Comes Next

Archiva is preparing for open-source release. Multi-provider support is in place (LM Studio, OpenAI, Anthropic). The migration script works. The OSS contributor docs are written. The UAT bugs are fixed.

The 300-query plateau is the open question. The hypothesis is that a larger embedding model – something that can distinguish “system hardening” from “configuration standards” without a synonym table – would move the ceiling. But hypotheses are cheap. Evals are what matter. And I have learned (at some cost) not to trust hypotheses that have not survived a 300-query run.

The workflow system – the V3.0 multi-agent orchestration with five-perspective reviews and deterministic validation now called PrescientFlow – has been the quieter success. 1,801 work units executed with mechanical quality gates. It turned a solo dreamer with plenty of SDLC experience just wanting to vibe code with an AI assistant into something that operates as a small engineering team, complete with the bureaucracy that necessarily implies. We all have our biases and blind spots. No one entity can know about and manage all the risks…yet.

But here is what I keep coming back to, nine months in: the most powerful thing about building with AI is not the speed. It is the ability to enforce discipline that no human team could sustain. Five independent reviews on every code change, every time. No reviewer fatigue. No social pressure to approve. No Friday afternoon shortcuts. The irony is that the most useful application of artificial intelligence in this project has been artificial discipline.

Whether that is worth 4,582 commits and 177 Claude Code sessions to arrive at – ask me again after the open-source release.

Technical Appendix A: Commit Velocity by Month

Month

Commits

Avg/Day

Key Theme

Sep 2025

52

5.2

Project scaffolding, 12-module MVP

Oct 2025

388

12.5

Architecture analysis, CLI audit, “Archiva” named

Nov 2025

436

14.5

Module restructuring, prompt management

Dec 2025

1,841

59.4

Sprint automation peak, WU-050 to WU-500+

Jan 2026

472

15.2

Search quality, threshold calibration, WU-500 to WU-724

Feb 2026

585

20.9

V3.0 migration, graph memory rebuild, WU-1406+ begins

Mar 2026

808

26.1

Eval maturation, 33%→87% accuracy, performance, OSS prep

Technical Appendix B: Evaluation History

Pre-V3.0 Era

Date

Scale

ET1

Notes

Jan 5

30q

33%

WU-616: first measured baseline (“CRITICAL REGRESSION”)

~Jan 26

30q

~60%

Pre-V3.0 improvements (threshold calibration, entity extraction)

30-Query Evals (V3.0 Era)

Sprint

Date

ET1

Semantic

Compound

Precision

11

Mar 6

60.0%

25.0%

54.5%

81.8%

14

Mar 6

60.0%

37.5%

54.5%

81.8%

17

Mar 8

56.7%

17b

Mar 8

60.0%

25.0%

54.5%

81.8%

22

Mar 9

66.7%

62.5%

54.5%

81.8%

100-Query Evals (V3.0 Era)

Sprint

Date

ET1

Semantic

Compound

Precision

MISS

Key Change

22

Mar 9

68.0%

64.1%

69.0%

78.1%

~14

ms-marco reranker swap

30

Mar 13

~62%

59.0%

58.6%

71.9%

~23

Re-ingestion baseline crash

35

Mar 15

65.0%

61.5%

55.2%

78.1%

~18

Semantic prototype expansion

42

Mar 17

67.0%

61.5%

58.6%

81.2%

17

BUG-PRECISION-001 fix

Post-fixes

Mar 21

85.0%

82.1%

79.3%

93.8%

0

5 cumulative bug fixes

Post-asymmetric

Mar 22

86.0%

82.1%

79.3%

96.9%

0

Asymmetric query handling

Post-perf

Mar 24

87.0%

82.1%

79.3%

96.9%

0

Entity extraction skip

300-Query Evals

Run

Date

ET1

Top-3

Top-10

MISS

Errors

p50

1

Mar 19

81.8%

91.3%

98.6%

4

14

9.3s

2

Mar 19

81.9%

91.9%

98.3%

5

2

9.7s

5

Mar 20

82.6%

92.6%

98.7%

4

2

6

Mar 21

81.6%

92.6%

99.0%

3

1

11

Mar 24

82.6%

92.6%

98.7%

4

1

1.25s

Note: ET1 varies 81.3-82.6% across runs for identical queries due to LLM rewrite non-determinism (~12 queries show tier-boundary flipping).

Technical Appendix C: Architecture

Retrieval Pipeline

Query → Intent Classification → [Vector Search + BM25 Search] → Score Fusion (RRF) → Cross-Encoder Reranking → Results

Synonym Expansion
Query Sanitization
Template Rewriting

Key Parameters

Parameter

Value

Why

Embedding model

snowflake-arctic-embed

Consistency across ingest/search

Reranker

ms-marco-MiniLM-L-6-v2

Fast, good recall

Fusion method

Reciprocal Rank Fusion

Scale-invariant

Chunk size

Adaptive (100-500 tokens)

Preserves section boundaries

Candidate pool (semantic)

75

Larger pool for broad queries

Candidate pool (precision)

50

Smaller pool, exact-match boosted

Technical Appendix D: Session & Work Unit Metrics

Metric

Value

Total commits

4,582

Work units completed

1,801

Highest WU ID

WU-1764

Bugs archived / open

249 / 20

Sprint plans (V3.0)

64

Review agents per WU

5

Claude Code sessions (total)

177

Archiva sessions

90

Workflow dev sessions

60

LLM caller sessions

26

Sprint validation log entries

1,534 (Dec 15 2025 - Jan 26 2026)

Pre-V3.0 WU range

WU-076 to WU-724

V3.0 WU range

WU-1406 to WU-1764

Validation Report

Methodology: All quantitative claims sourced from git log, sprint-validation.jsonl, Claude Code session files, work unit frontmatter, and evaluation run outputs. The 33% baseline is from WU-616 (git commit 587f5cbc, January 5, 2026). Monthly commit counts verified via git log --since/--until. V2.9 infrastructure sizes verified via wc -l, wc -c, and find against .claude-bak/. Eval scale (30q vs 100q vs 300q) is explicitly labeled for every figure. 300-query ET1 reported as range (80.9%-82.6%), not single value.

Sources:

  • Git history: git log --oneline | wc -l = 4,582 commits

  • Sprint validation log: .claude-bak/logs/sprint-validation.jsonl (1,534 entries)

  • Claude Code sessions: ~/.claude/projects/ (177 sessions across 4 projects)

  • Work units: .claude/work-units/completed/ (1,801 files)

  • Bug archive: .claude/bugs/archive/ (249 files)

  • Eval data: memory/eval_history.md, memory/eval_history_300q.md

  • 33% baseline: commit 587f5cbc (WU-616, Jan 5, 2026)