Apr 1, 2026

Building Archiva: Nine Months on the RAG Frontier

On January 5, 2026, I ran my first proper evaluation against a 30-query test suite and got back a number that stopped me cold: 33% Top-1 accuracy. The commit message reads “CRITICAL REGRESSION (33% Top-1 vs 80% target).” One in three queries returning the correct answer. For a system I had been building for four months.

That number was the most useful thing the project had produced to that point. Not because it was good. Because it was real. Everything before it had been intuition and demo-quality spot checks. Everything after it would be measured.

Nine months, 4,582 commits, 1,801 work units, and 249 archived bugs later, that 33% became 87%. This is the story of how, and what I got wrong along the way.

Act 1: Before We Knew What We Didn’t Know (June-September 2025)

The idea started taking shape around mid-2025. I spent a couple of months researching RAG architectures, reading papers on hybrid retrieval, experimenting with embedding models. By the time I created the repository in late September, I had convinced myself I understood the problem. I did not. But I had enough momentum to start building, which turned out to matter more.

The first commit landed on September 21, 2025 – fourteen commits in a single day, twelve modules scaffolded between 12:53 PM and 4:55 PM. document_assistant, document_categorizer, document_knowledge_graph, query_classifier, pdf_embedder, docx_embedder, hybrid_search, fts_indexer, a knowledge graph builder, an entity extractor, a training platform.

The idea was straightforward: take a pile of PCI-DSS compliance documents, chunk them, embed them, and let users search with natural language. Vector search plus BM25 keyword matching, combine the scores, return results.

It was a lot of infrastructure for a search box. But nobody told me (and I would not have believed them if they had) that the search box is the simplest part. Everything between the user pressing Enter and seeing results – the query rewriting, the intent classification, the candidate retrieval, the score fusion, the reranking, the chunk expansion – that is where the engineering lives. And that is where the bugs live too.

52 commits by the end of September. An end-to-end integration by the 27th. A client orchestrator backend by the 29th. It felt like progress.

Act 2: The Rewrite (October-November 2025)

October was humbling. Besides Opus 4.5 launching and materially changing my workflow (see The Guardrails I Built to Stop AI From Breaking My Code (And Why I Needed Them)) my workflow produced 388 commits, nearly all of them architecture analysis and remediation. I ran the codebase through a multi-batch architecture review – eight batches, phases 1 through 40, with three-agent review panels. The results were not kind. Schema inconsistencies between modules. The embedder format needed migration to a unified schema. The CLI needed significant remediation. The price of building with raw Claude and not wanting to put a ton of effort into PRDs to guide it because I was on a quest (multiple quests actually) to add an abstraction layer that would help “vibe coders” produce better outcomes.

This is when “Archiva” got its name. October 24, buried in an OSS readiness report, the project stopped being “document-processing-modules” intended to accelerate my most common workflows in compliance and risk assessment and became something with an identity. This is also when the first work units appeared – not yet the numbered format that would come later, but named identifiers: WU-AUDIT-001, WU-CLI-005 through WU-CLI-013. My claude workflow was finding its shape.

November continued the restructuring. 436 more commits. document_assistant became output_orchestrator. hybrid_search moved to retrieval_orchestrator. The prompt management system got centralised. The OutputFormatter infrastructure landed. None of this was visible to users. All of it was necessary.

This is the phase that separates projects that ship from projects that demo well once. And I will be honest: it was tedious. Two months of renaming modules and migrating schemas does not make for exciting commit messages. But a system you are going to iterate on for six more months needs consistent naming and centralised configuration. The embedding mismatch bug proved that (more on this shortly).

The work units were still using named IDs – WU-SEARCH-001, WU-PROMPT-xxx, WU-014B-xxx. The formal sprint system did not exist yet. I was managing work through ad-hoc planning and manual tracking. It worked. Until December, when it very much did not. Along the way the workflow began to accelerate (and fail faster) - see From "Got an error. Please investigate" to Building a Production App in 2.5 Hours: My AI Engineering Evolution – and got memory - see  How I Gave My AI Agents a Memory—And Built a Full-Stack App in 1 Hour - the essential workflow component that made completing Archiva, with it’s high complexity and need for matching what goes in to what comes out of a different piece of code, possible.

Act 3: The December Explosion (December 2025)

1,841 commits in 31 days. Nearly 60 per day. Four times the volume of any other month. Most nights after work with my laptop on my lap watching TV with the family just sitting through the much longer autonomous workflow enabled wait-state till it needed me to activate the next stage.

Two things happened simultaneously: the sprint validation system went live on December 15, and work units switched to numeric IDs starting at WU-050. The sprint-validation.jsonl log tells the story – 1,534 entries spanning December 15 through January 26, capturing every work unit execution from WU-076 to WU-724.

The early entries are full of “No commit hash found in agent output” warnings and “Expected file missing: TBD” messages. The validation system was bootstrapping itself while validating real work. By late January, those same entries show model metadata, execution durations, fallback chains, and quality scores. The workflow self-learned to run by walking.

The Embedding Mismatch

One of the earliest bugs was also one of the most instructive about the weakness that has to be resolved either by an experienced and diligent human describing the goal in simple terms OR a workflow that can expand a problem statement into those simple, perfect sized chunks that make AI decisions efficient, accurate, and less non-deterministic. So what happened? Because each piece was built separately without giving Claude a lot of input, the ingestion pipeline was using nomic-embed-text to create embeddings and the search pipeline was using snowflake-arctic-embed to create query embeddings. Two different models. Incompatible vector spaces.

The system “worked.” It returned results. They were just wrong. Not spectacularly wrong – wrong enough to feel plausible, which is worse. If your search returns garbage, you know it is broken. If it returns plausible-but-incorrect passages, human biases cause us to trust it and end up making bad decisions (why an AIGP or at least effective AI governance is essential for success not just passing compliance audits).

The fix was one line in a config file. The damage was weeks of building on a broken foundation without knowing it. Architecture Lesson Learned: Centralise your model configuration, or prepare to debug phantom failures that look like algorithm problems but are actually config problems.. And of course, make sure your AI coder can see all the dependencies.

Act 4: The 33% Wake-Up Call (January 2026)

January brought 472 commits and a steady progression from WU-500 through WU-724. Threshold calibration, entity extraction, compound query handlers. The kind of work that feels productive.

Then, on January 5, WU-616: the first proper 30-query evaluation. 33% Top-1 accuracy. The commit message includes “CRITICAL REGRESSION” in caps, which is git for screaming.

The number was not a regression from a known good state. It was the first time I had measured the state at all. The “regression” was from my assumption about how well the system worked to the reality of how well it actually worked. Those are different things, and the gap between them was 47 percentage points.

This was the moment the evaluation framework went from “nice to have” to “the only thing that matters.” I had been building features for four months. I should have been building measurements. It’s also when I added BUG records to the memory and required the planner to look over BUGs as well as the graph DB to build the work unit context to avoid loops where claude will just keep experimenting and often repeating itself trying to work out the real root cause of a complex failure.

The sprint validation log ends on January 26 with WU-724 – “Tier classification confidence thresholds align with snowflake-arctic score ranges.” That was the last work unit under the old workflow system.

By January the workflow achieved version 2.9.4, its eighth major iteration in four months. I had iterated the workflow almost as aggressively as the search system because it was the only way to get effective and consistent progress toward a complex target.

Act 5: Re-platforming the Factory (February 2026)

Eight Versions in Five Months

A brief archaeology, because the V3.0 story does not make sense without it.

V2.2 (late September 2025) introduced structured work units and seven-agent reviews. V2.3 added hard schema validation. V2.4 added sprint orchestration. V2.6 brought an embedding-based memory system. V2.8 went lean – cut the script count from 45 to 12, trimmed reviews from seven agents to five. V2.9 added graph memory, the three-tier hierarchy (Epic/Story/Task), and a dedicated Planner agent.

By the time V2.9.4 shipped in late December, the workflow had grown into a respectable custom framework: 136 files in the scripts directory (95 Python, plus shell scripts and supporting files) totalling about 2.6MB, a 241KB sprint orchestrator (sprint_runner.py), agent prompt templates stored as markdown files, and – critically – a monolithic bash pre-commit hook of nearly 1,000 lines that served as the primary deterministic enforcement gate.

That pre-commit hook was the keystone. A thousand lines of bash parsing work unit IDs from commit messages, looking up work unit files to determine their tier, checking for review files in .claude/agent-reviews/, validating config version fields, verifying LM Studio availability, and enforcing tier-specific review requirements. Everything ran through git commit. If you could not commit, the gate held. If you could commit, everything was assumed to be fine.

A git-commit-oriented gate only catches problems at commit time. It cannot prevent an LLM from generating bad content in the first place. It cannot validate a plan before implementation begins. It cannot check whether a vision document has file paths before Claude spends an hour implementing it. The gate was at the end of the pipeline, not along the way.

BUG-016 was the proof. A work unit that “fixed” a threading bug had actually reverted a previous fix. The pre-commit hook checked that review files existed – and they did. The reviews said “looks good.” The regression went undetected for 51 days across 19 work units. The gate passed because the gate only checked structure, not semantics. And the reviews passed because the same LLM that wrote the code was reviewing the code, with no separation of duties enforced by the infrastructure.

What V3.0 Actually Changed

V3.0 did not invent work units, reviews, or sprints. Those had existed since V2.2. What V3.0 did was re-platform the enforcement mechanism from custom scripts and a monolithic git hook to Claude Code’s native primitives.

The mapping:

V2.9 (Custom)

V3.0 (Native)

136 scripts (~2.6MB)

~37 focused Python validators

sprint_runner.py (241KB orchestrator)

/sprint skill + Task tool

Markdown agent templates in ~/.claude/templates/

Skills in .claude/skills/ (19 skills)

Bash pre-commit hook (~1,000 lines)

Claude hooks + lean git hooks + orchestrate.py

Custom memory scripts

MCP graph-memory server

The philosophical shift was deeper than the tooling change. V2.9’s enforcement was git-centric: “can you commit?” V3.0’s enforcement is validator-centric: “did the deterministic check pass?” Each validator is a focused Python script under 50 lines – validate_vision.py, validate_planner.py, validate_sprint.py – coordinated by orchestrate.py in validate-remediate loops. If validation fails, the orchestrator formats the error and sends it back to Claude for remediation. The LLM never sees the validation logic. The validation logic never trusts the LLM.

The deeper principle is one I keep coming back to: deterministic validation of non-deterministic outputs. Claude generates code; a Python script validates it. Claude proposes a plan; a schema checker verifies the required fields. Claude writes a commit; a pre-commit hook checks for scope creep and secret exposure. The LLM creates. The script validates. The script does not have a context window. The script does not cut corners at 100K tokens. The script does not get talked out of its rules.

V3.0 also reinstated the full five-agent review gate at all tiers (Vision, Scope, Design, Testing, Tattle-Tale), enforced by policy after BUG-016. V2.9 had optimised reviews down to “builder-only” at the task tier – faster, yes, but that optimisation is what let the 51-day regression through. Bureaucracy must exist to manage chaotic good/neutral/evil actors in any system, right?

The Migration

The V3.0 workflow was built in its own project – improvingclaudeworkflow-v3.0 – across 60 Claude Code sessions. Meanwhile, the tools/llm-caller-cli project consumed another 26 sessions for the shared LLM calling infrastructure. 177 Claude Code sessions total across all related projects. (Yes, I counted.)

The gap in Archiva’s work unit IDs tells the story of where my attention went: WU-724 (January 26) jumps to WU-1406 (February 16). Those 680 missing IDs were consumed by the workflow system’s own development. Building the factory before resuming production.

On February 16, the commit message reads: “Save work before V3.0 migration.” Then the new infrastructure starts landing: workflow upgrades from 3.0.1 to 3.0.26, graph memory rebuilt with 2,620 nodes and 1,614 edges. The old .claude/ directory – all 97 scripts, the bash hooks, the agent prompt templates – was backed up to .claude-bak/ and replaced wholesale.

Act 6: The Long Climb (March 2026)

808 commits. This is where Archiva went from “demo that sometimes works” to “system I would let an auditor use.”

The Scoreboard

Two different eval scales matter here, and I am going to be explicit about which is which (because mixing them up would make me look better than the data supports).

30-query evals (the early, fast checks):

Sprint

Date

ET1

Semantic

Compound

Precision

11

Mar 6

60.0%

25.0%

54.5%

81.8%

17

Mar 8

56.7%

17b

Mar 8

60.0%

25.0%

54.5%

81.8%

22

Mar 9

66.7%

62.5%

54.5%

81.8%

100-query evals (the real measure, available from Sprint 22 onward):

Sprint

Date

ET1

Semantic

Compound

Precision

Key Event

22

Mar 9

68.0%

64.1%

69.0%

78.1%

ms-marco reranker swap

30

Mar 13

~62%

59.0%

58.6%

71.9%

Baseline crash (re-ingestion)

42

Mar 17

67.0%

61.5%

58.6%

81.2%

BUG-PRECISION-001 fix

Post-fixes

Mar 21

85.0%

82.1%

79.3%

93.8%

5 cumulative bug fixes

Post-perf

Mar 24

87.0%

82.1%

79.3%

96.9%

Entity extraction skip (8x faster)

That climb from 33% (January) through 68% (Sprint 22) to 87% (post-performance) took three months and roughly 30 sprints. Each percentage point was a specific bug found, diagnosed, and fixed one at time to assure each fix was the right fix to make. There were no magic bullets and just grunting through (while the workflow kept improving and reducing the amount of hands-on from me with longer running autonmous sessions producing well validated functionality).

The Sprint 30 Crash: When Better Chunking Made Things Worse

Sprint 30 is the story I almost did not want to tell, because the mistake was so obvious in hindsight.

I improved the chunking strategy. Adaptive chunk sizing, 100-500 tokens, preserving section boundaries instead of cutting at arbitrary character counts. Objectively better. I re-ingested the entire corpus with the new chunking.

ET1 dropped from 68% to 62%. Six percentage points, gone.

The chunking was better. The ground truth was wrong. Every query in the eval suite had been mapped to specific chunks under the old layout. New chunks had different boundaries, different IDs, different content splits. A query that used to match chunk 47 now needed to match chunk 312, but the ground truth still said “chunk 47.” The eval was measuring whether my system returned chunks that no longer existed.

So we built a ground truth regeneration script that uses word-boundary regex matching to find each expected passage in the new chunk layout. It auto-fixed 270 of 300 queries. The remaining 22 needed manual review – their expected content had been split across chunk boundaries in the new layout, and no single chunk contained the full answer.

Your ground truth is coupled to your chunking strategy. Change one, update the other, or your eval results are measuring a fiction. I lost a week to this. The fix was not in the search algorithm. The fix was in the test harness.

The Sprint 17 Regression: Hubris, Meet Data

Sprint 17 is my favourite failure because it happened and was fixed in a single day (March 8), and because we should have seen it coming.

The reranker (BAAI/bge-reranker-base) was applying a double-sigmoid normalization that compressed all scores into a narrow range. We removed it. Mathematically correct – the reranker scores should flow through raw, preserving signal fidelity.

Accuracy dropped 3.3 percentage points immediately. Same-day eval confirmed it.

What we had not accounted for: the blend weights downstream were calibrated to the compressed score distribution. Remove the compression, and the reranker’s raw scores – which were already near-uniform around 0.50 – got amplified instead of dampened. The “better signal” was actually more noise, fed into a system tuned for the old noise profile.

The fix (also March 8, Sprint 17b) was not to restore the double-sigmoid. It was to reduce the reranker’s influence in the blend weights. Semantic blend went from 0.35/0.65 to 0.55/0.45. Less weight on a weak signal is better than normalizing a weak signal to look strong.

In a fusion pipeline, you cannot change one component’s output distribution without recalibrating everything downstream. Every stage trusts the statistical properties of the stage before it. Change those properties, and “improvement” becomes regression. We knew this in theory. Sprint 17 made sure I knew it in practice.

Five Bugs That Made 85% Possible

The jump from 67% (Sprint 42) to 85% was not one fix. It was five fixes applied in sequence over four days, each building on the last:

  1. BUG-EVAL-072: PCI-DSS acronyms in queries triggered precision/detail intent classification, which blocked the semantic rewriter from running at all. Semantic queries about PCI-DSS topics were being treated as exact-match lookups. The fix: exclude standard-name-only patterns from the semantic rewrite blocker.

  2. BUG-EVAL-071: The semantic rewriter was replacing query keywords with generic corpus vocabulary. “Key management” became “cryptographic operations.” The fix: preserve original query terms in the rewritten output, always.

  3. BUG-EVAL-074: Ground truth for one query was mapped to the wrong chunk. The eval said we were failing, but we were actually returning the right answer. (Even the eval can be wrong. Especially the eval can be wrong.)

  4. BUG-SEARCH-004: Domain-agnostic concept synonyms. “Hardening” should also match “configuration standards.” “Authentication” should also match “identity verification.” A synonym expansion table, queried at search time.

  5. BUG-SEARCH-005: FTS5 queries were crashing on commas and colons in user input. Not returning wrong results – crashing silently and falling back to vector-only search. Twenty-two queries were affected. The fix: sanitize punctuation before constructing FTS5 query strings. Also lesson learned: *** Don’t fail open when it really matters ***

None of these were glamorous. None of them would make a good conference talk. All of them a more experienced Engineer probably would have known to prevent. But together they moved the needle 18 percentage points.

Lesson Learned: RAG accuracy is not about the algorithm or model you choose. It is about the plumbing. Natural Language is a source dependent construct so your plumbing has to enhance and avoid assumptions if your model doesn’t have the experience to compensate.

The 300-Query Reality Check

After hitting 85% on the 100-query eval, I ran the full 300-query suite. 82.6% ET1, 92.6% Top-3, 96.0% Top-5, 98.7% Top-10.

The drop from 87% (100q) to the low-80s (300q) is expected – larger eval suites include harder queries and more edge cases. But here is the thing I have to be honest about: the 300-query ET1 has hovered between 80.9% and 82.6% across 11 runs. It does not move in a meaningful direction. Whether that is a plateau I can climb past or a ceiling imposed by the architecture depends on whether you think a better embedding model would change the physics of the problem.

Four queries persistently MISS. pci_q126 asks about “authentication” and retrieves “Sensitive Authentication Data” (PCI Requirement 3) instead of “User Authentication” (Requirement 8) – the same word means different things in different parts of the standard (WT_, am I right?). pci_q172 flips between TOP-10 and MISS across runs because of LLM rewrite non-determinism. About 12 queries show this kind of tier-boundary flipping: ET1 varies 81.3-82.6% for the exact same 300 queries depending on how the semantic rewriter phrases things. That variance is a fundamental property of any pipeline that includes an LLM, and I do not yet have a good answer for it–except maybe reaching out to the Council to suggest wording improvements under the premise that AI is always going to struggle if your input data isn’t clean and well organized.

Performance: The Embarrassing Fix

For most of March, the 300-query eval took 6 to 16 hours to complete. Not because the queries were slow. Because I was spawning a new Python subprocess for every LLM call. Every query rewrite, every entity extraction, every answer generation: fork a process, load Python, import the modules, make the call, exit. My initial premise as I designed my tools was that modularizing common components would be essential to building more complex solutions. Lesson Learned: As any better trained Engineer/Solution Architect would have told me, abstraction = tradeoffs.

The fix was exactly what you think it was. Stop forking. Import the module directly. Call the function.

  • Before: p50 latency ~10,000ms per query

  • After: p50 latency 1,251ms per query

8x faster. The 300-query eval now completes in 30 minutes. And accuracy went up 2 percentage points. The entity extraction step – the one adding 5-15 seconds per query via subprocess – was not just slow. It was adding noise to the retrieval signal. Removing it made the system both faster and more accurate.

If an expensive preprocessing stage does not measurably improve retrieval, it is not enrichment. It is waste. Measure before you assume.

Act 7: What I Learned About Claude

The search system is half the story. The other half is the thing building the search system.

The Context Window Is a Pressure Gauge

I have written about this before, but it bears repeating because I watched it happen repeatedly during Archiva development. Around Sprint 8, I noticed that builder agents were producing thinner test coverage toward the end of long sessions. Reviews got less thorough. Commit messages got vaguer. The quality degradation was gradual enough to miss if you were not tracking it.

The V3.0 workflow system’s response was architectural: keep each work unit small (one task, one commit, one measurable outcome) and move validation into deterministic scripts that do not have context windows. The pre-commit hook does not care how long the conversation has been. It checks the same rules at token 5,000 and token 150,000.

The broader principle is not really about Claude specifically. It is about any system where quality is a function of resource pressure. Humans get tired. LLMs get compressed. Both need guardrails that are independent of their current capacity.

The Two-Commit Pattern

One of the most painful bugs in the project’s history was WU-656. Implementation committed at 11:04 AM. Plan review generated at 11:14 AM. The review caught P0 issues – nine minutes after the code was already archived.

The fix was the two-commit pattern. Commit 1: work unit definition and plan reviews (before implementation starts). Commit 2: implementation code and output reviews (after implementation, only if reviews pass). This creates an unforgeable timeline in git. You can prove that reviews ran before code was committed.

Is this overhead? Yes. Essential bureacracy is a necessary cost of good business. But WU-656 proved that post-hoc reviews are security theater. If the review cannot block the commit, the review is a report, not a gate.

Act 8: What I Learned About RAG

The Retrieval Pyramid

After nine months, here is my mental model for RAG accuracy, from most to least impactful:

  1. Chunking strategy (how you split documents determines what can be found)

  2. Evaluation framework (you cannot improve what you cannot measure)

  3. Query understanding (intent classification, synonym expansion, query rewriting)

  4. Fusion and ranking (how you combine and order candidates)

  5. Embedding model quality (necessary but not sufficient)

  6. Answer generation (the LLM at the end matters least for retrieval accuracy)

Most RAG tutorials I’ve seen start at the bottom of this list because simple RAG is just about the vector DB. They focus on which LLM to use for generation, which embedding model to pick, maybe which vector database. Those things matter, but they are not where the bugs are. The bugs are in chunking, intent classification, and FTS5 punctuation handling. Always the plumbing. Vectors alone cannot give you high accuracy with Natural Language queries. Accuracy comes from enhancing the context to “understand” how to ask for the best chunks.

Intent Classification Was the Biggest Single Win

The jump from the 30-query era (semantic at 25%) to Sprint 22 (semantic at 64.1% on the 100-query eval) came from two changes: swapping the reranker to ms-marco-MiniLM-L-6-v2 and enabling intent-aware pool sizing.

A “precision” query (“What does Requirement 8.3.1 say?”) needs a small candidate pool and exact-match boosting. A “semantic” query (“How does PCI-DSS address encryption at rest?”) needs a large candidate pool and semantic similarity scoring. Without intent classification, every query gets the same treatment. With it, each query type gets the retrieval strategy it needs.

On the 100-query eval: precision went from 78.1% (Sprint 22) to 93.8% (post-fixes). Semantic went from 64.1% to 82.1%. Same documents. Same embeddings. Different retrieval strategy per query type.

Fusion Pipelines Are Coupled Systems

Reciprocal rank fusion is supposed to be elegant. Combine ranks from vector search and keyword search with a simple formula, done. In practice, every parameter is coupled to every other parameter. Change the reranker model, the blend weights are wrong. Change the pool size, the fusion scores shift. Remove a normalization step, everything downstream breaks. Sprint 17 proved this (remove one sigmoid, lose 3.3 percentage points), and BUG-EVAL-071 proved the corollary: even the query rewriter’s choice of synonyms propagates through the entire pipeline. “Key management” and “cryptographic operations” retrieve different candidate sets, which produce different fusion scores, which produce different rankings. One word substitution in the query, different answer at the end. In a fusion pipeline, nothing is local.

The Current Scorecard

Archiva v3.0.26. March 2026.

Metric

Value

Effective Top-1 (100q)

87.0%

Effective Top-1 (300q)

80.9%-82.6% (across 11 runs)

Top-3 (300q)

92.6%

Top-10 (300q)

98.7%

p50 latency

1,251ms

p95 latency

7,099ms

Total commits

4,582

Work units executed

1,801

Bugs archived

249

Sprint plans completed

64

Claude Code sessions

177 (across all related projects)

Persistent MISS queries

4 out of 300

Four queries MISS. One is a genuine homograph problem (the word “authentication” means different things in different PCI requirements). One flips with LLM non-determinism. Two are at the embedding model’s semantic boundary. They are not bugs to fix. They are the current limits of the architecture.

The 82.6% number on 300 queries has not moved in 11 runs. I do not know yet whether that is a plateau I can climb past with a better embedding model, or a ceiling imposed by the fundamental approach. Acknowledging what you do not know is (I fortunately learned long ago during my Philosophy days) more useful than pretending you have a plan or answer for everything.

What Comes Next

Archiva is preparing for open-source release. Multi-provider support is in place (LM Studio, OpenAI, Anthropic). The migration script works. The OSS contributor docs are written. The UAT bugs are fixed.

The 300-query plateau is the open question. The hypothesis is that a larger embedding model – something that can distinguish “system hardening” from “configuration standards” without a synonym table – would move the ceiling. But hypotheses are cheap. Evals are what matter. And I have learned (at some cost) not to trust hypotheses that have not survived a 300-query run.

The workflow system – the V3.0 multi-agent orchestration with five-perspective reviews and deterministic validation now called PrescientFlow – has been the quieter success. 1,801 work units executed with mechanical quality gates. It turned a solo dreamer with plenty of SDLC experience just wanting to vibe code with an AI assistant into something that operates as a small engineering team, complete with the bureaucracy that necessarily implies. We all have our biases and blind spots. No one entity can know about and manage all the risks…yet.

But here is what I keep coming back to, nine months in: the most powerful thing about building with AI is not the speed. It is the ability to enforce discipline that no human team could sustain. Five independent reviews on every code change, every time. No reviewer fatigue. No social pressure to approve. No Friday afternoon shortcuts. The irony is that the most useful application of artificial intelligence in this project has been artificial discipline.

Whether that is worth 4,582 commits and 177 Claude Code sessions to arrive at – ask me again after the open-source release.

Technical Appendix A: Commit Velocity by Month

Month

Commits

Avg/Day

Key Theme

Sep 2025

52

5.2

Project scaffolding, 12-module MVP

Oct 2025

388

12.5

Architecture analysis, CLI audit, “Archiva” named

Nov 2025

436

14.5

Module restructuring, prompt management

Dec 2025

1,841

59.4

Sprint automation peak, WU-050 to WU-500+

Jan 2026

472

15.2

Search quality, threshold calibration, WU-500 to WU-724

Feb 2026

585

20.9

V3.0 migration, graph memory rebuild, WU-1406+ begins

Mar 2026

808

26.1

Eval maturation, 33%→87% accuracy, performance, OSS prep

Technical Appendix B: Evaluation History

Pre-V3.0 Era

Date

Scale

ET1

Notes

Jan 5

30q

33%

WU-616: first measured baseline (“CRITICAL REGRESSION”)

~Jan 26

30q

~60%

Pre-V3.0 improvements (threshold calibration, entity extraction)

30-Query Evals (V3.0 Era)

Sprint

Date

ET1

Semantic

Compound

Precision

11

Mar 6

60.0%

25.0%

54.5%

81.8%

14

Mar 6

60.0%

37.5%

54.5%

81.8%

17

Mar 8

56.7%

17b

Mar 8

60.0%

25.0%

54.5%

81.8%

22

Mar 9

66.7%

62.5%

54.5%

81.8%

100-Query Evals (V3.0 Era)

Sprint

Date

ET1

Semantic

Compound

Precision

MISS

Key Change

22

Mar 9

68.0%

64.1%

69.0%

78.1%

~14

ms-marco reranker swap

30

Mar 13

~62%

59.0%

58.6%

71.9%

~23

Re-ingestion baseline crash

35

Mar 15

65.0%

61.5%

55.2%

78.1%

~18

Semantic prototype expansion

42

Mar 17

67.0%

61.5%

58.6%

81.2%

17

BUG-PRECISION-001 fix

Post-fixes

Mar 21

85.0%

82.1%

79.3%

93.8%

0

5 cumulative bug fixes

Post-asymmetric

Mar 22

86.0%

82.1%

79.3%

96.9%

0

Asymmetric query handling

Post-perf

Mar 24

87.0%

82.1%

79.3%

96.9%

0

Entity extraction skip

300-Query Evals

Run

Date

ET1

Top-3

Top-10

MISS

Errors

p50

1

Mar 19

81.8%

91.3%

98.6%

4

14

9.3s

2

Mar 19

81.9%

91.9%

98.3%

5

2

9.7s

5

Mar 20

82.6%

92.6%

98.7%

4

2

6

Mar 21

81.6%

92.6%

99.0%

3

1

11

Mar 24

82.6%

92.6%

98.7%

4

1

1.25s

Note: ET1 varies 81.3-82.6% across runs for identical queries due to LLM rewrite non-determinism (~12 queries show tier-boundary flipping).

Technical Appendix C: Architecture

Retrieval Pipeline

Query → Intent Classification → [Vector Search + BM25 Search] → Score Fusion (RRF) → Cross-Encoder Reranking → Results

Synonym Expansion
Query Sanitization
Template Rewriting

Key Parameters

Parameter

Value

Why

Embedding model

snowflake-arctic-embed

Consistency across ingest/search

Reranker

ms-marco-MiniLM-L-6-v2

Fast, good recall

Fusion method

Reciprocal Rank Fusion

Scale-invariant

Chunk size

Adaptive (100-500 tokens)

Preserves section boundaries

Candidate pool (semantic)

75

Larger pool for broad queries

Candidate pool (precision)

50

Smaller pool, exact-match boosted

Technical Appendix D: Session & Work Unit Metrics

Metric

Value

Total commits

4,582

Work units completed

1,801

Highest WU ID

WU-1764

Bugs archived / open

249 / 20

Sprint plans (V3.0)

64

Review agents per WU

5

Claude Code sessions (total)

177

Archiva sessions

90

Workflow dev sessions

60

LLM caller sessions

26

Sprint validation log entries

1,534 (Dec 15 2025 - Jan 26 2026)

Pre-V3.0 WU range

WU-076 to WU-724

V3.0 WU range

WU-1406 to WU-1764

Validation Report

Methodology: All quantitative claims sourced from git log, sprint-validation.jsonl, Claude Code session files, work unit frontmatter, and evaluation run outputs. The 33% baseline is from WU-616 (git commit 587f5cbc, January 5, 2026). Monthly commit counts verified via git log --since/--until. V2.9 infrastructure sizes verified via wc -l, wc -c, and find against .claude-bak/. Eval scale (30q vs 100q vs 300q) is explicitly labeled for every figure. 300-query ET1 reported as range (80.9%-82.6%), not single value.

Sources:

  • Git history: git log --oneline | wc -l = 4,582 commits

  • Sprint validation log: .claude-bak/logs/sprint-validation.jsonl (1,534 entries)

  • Claude Code sessions: ~/.claude/projects/ (177 sessions across 4 projects)

  • Work units: .claude/work-units/completed/ (1,801 files)

  • Bug archive: .claude/bugs/archive/ (249 files)

  • Eval data: memory/eval_history.md, memory/eval_history_300q.md

  • 33% baseline: commit 587f5cbc (WU-616, Jan 5, 2026)