Juggling Risk for Fun and Profit: Evaluation and metrics are crucial for AI workflow evolution

Going to be a short post today with just some raw results from my claude workflow evolution.

Note: Using claude alone won't get you these details.

Top 10 Longest Autonomous Sessions by Duration

Project: Archiva

Analysis Period: September 20, 2025 - December 9, 2025

Methodology: Sessions defined as commits <30 minutes apart

Summary Table

Rank	Version	Duration	Commits	Density	Date	What Was Done
1	V2.2	144 min	16	0.11/min	Oct 12	P4-13 → P5-8 documentation validation
2	V2.6.1	132 min	39	0.30/min	Nov 9	Test fix sprint (WU-TEST-FIX series)
3	V2.3.2	115 min	9	0.08/min	Oct 26	CLI remediation Phase 1
4	V2.2	105 min	11	0.11/min	Sep 28	Module reorganization
5	V2.7	88 min	10	0.11/min	Nov 19	PAA OutputFormatter
6	V2.6.1	80 min	10	0.12/min	Nov 8	Test integrity fixes
7	V2.6.1	79 min	18	0.23/min	Nov 7	V2.6.1 workflow deployment
8	V2.4.1	78 min	6	0.08/min	Oct 30	CLI testing (338 tests passing)
9	V2.4.1	77 min	4	0.05/min	Nov 1	Search validation + prompt fixes
10	V2.7	69 min	7	0.10/min	Nov 12	Circuit breaker + RRF fusion

The V2.9 Difference

V2.9's longest session is only 64 minutes but look at the commit density:

V2.9 Session	Duration	Commits	Density
#1	64 min	39	0.61/min
#3	40 min	48	1.20/min
#6	21 min	29	1.37/min
#8	17 min	25	1.51/min

Key Insight

V2.9 sessions are 5-15x more productive per minute than earlier versions:

The **40-minute V2.9 session** (#3) produced **48 commits**
The **144-minute V2.2 session** (#1) produced only **16 commits**

V2.9 completed 3x more commits in 28% of the time.

Version Distribution

Version	Sessions in Top 10	Avg Duration	Avg Density
V2.2	2	125 min	0.11/min
V2.3.2	1	115 min	0.08/min
V2.4.1	2	78 min	0.07/min
V2.6.1	3	97 min	0.22/min
V2.7	2	79 min	0.11/min

V2.9 does not appear in top 10 by duration because its architecture favors shorter, denser sessions.

What Enabled Long Sessions by Version

Version	Key Enabler	Why Sessions Were Long
V2.2	Work unit structure	Structured but 7-agent review overhead
V2.3.2	Hook improvements	Better validation but same review overhead
V2.4.1	Define-and-deploy agent	Single-session completion but slow
V2.6.1	Memory system	Context preserved, less human re-explanation
V2.7	Memory upgrade	Better retrieval, sustained context

Why V2.9 Sessions Are Shorter But Better

Factor	V2.2-V2.7	V2.9
Reviews per work unit	7 agents × 2 phases = 14	Builder only × 2 phases = 2
Human approval level	Per work unit	Per Story/Epic (not per Task)
Context system	None → Vector memory	Graph memory
Commit overhead	High (review artifacts)	Low (builder only)

Result: V2.9 completes more work in less time with fewer commits per work unit.

Appendix A: Methodology

A.1 Data Sources

All metrics were computed from:

1. **Git commit history**: `git log --format="%ai|%s" --since=2025-09-20`

2. **Agent review files**: `.claude/agent-reviews/*.md`

3. **Workflow version commits**: Specific commits marking version transitions

A.2 Session Detection Algorithm

Definition: An "autonomous session" is a sequence of commits where
each consecutive pair has a gap of less than 30 minutes.

Algorithm:
1. Sort all commits chronologically (oldest first)
2. Initialize first commit as start of Session 1
3. For each subsequent commit:
   - Calculate gap = (current_timestamp - previous_timestamp)
   - If gap < 30 minutes: add to current session
   - If gap >= 30 minutes: close current session, start new session
4. Calculate session metrics:
   - Duration = last_commit_timestamp - first_commit_timestamp
   - Commit count = number of commits in session
   - Density = commit_count / duration_in_minutes

Why 30 minutes?

The threshold was chosen because:

Claude typically commits within 1-15 minutes when working continuously
Gaps >30 minutes almost always indicate Claude waiting for human input
This threshold is consistent with the original Archiva autonomous sessions report

A.3 Version Attribution

Each commit was attributed to a workflow version based on its date:

Date Range	Version	Key Characteristics
Sep 20 - Sep 25	Pre-workflow	No structured workflow
Sep 25 - Oct 26	V2.2	Work units, 7-agent reviews
Oct 26 - Oct 30	V2.3.2	Hook improvements
Oct 30 - Nov 6	V2.4.1	Define-and-deploy agent
Nov 6 - Nov 12	V2.6.1	Memory system added
Nov 12 - Nov 27	V2.7	Memory upgrade
Nov 27 - Dec 3	V2.8	Lean workflow (45→12 scripts)
Dec 3 - present	V2.9	Graph memory, builder-only reviews

Source: Version transition commits in git history

A.4 Commit Density Calculation

Density = Total Commits in Session / Duration in Minutes

Example (V2.2, Oct 12):
- 16 commits over 144 minutes
- Density = 16 / 144 = 0.11 commits/min

Example (V2.9, Dec 8):
- 48 commits over 40 minutes
- Density = 48 / 40 = 1.20 commits/min

A.5 Examples from Each Workflow Version

V2.2 Example (Oct 12, 2025 - Session #1)

Git log excerpt:

2025-10-12 12:50:56 | [Work Unit] P4-13: Create validate_setup.py for shared Module
2025-10-12 13:01:16 | docs(P4-13): Rename validate_module.py to validate_setup.py
↑ 10 minute gap - includes 7-agent review cycle

Agent reviews generated: 35 review files from Oct 12 session

Pattern: `vision-2025-10-12-*.md`, `scope-2025-10-12-*.md`, etc.
Each work unit required 7 agents × 2 phases = 14 reviews

Why 10 minutes between commits?

7 agents ran reviews on the work unit plan
Claude read and synthesized review findings
Implementation took only ~2 minutes
Remaining time was review overhead

Citation (commit fcfe6ae):

Vision: ALIGNED - Solves real usability issue
Scope: APPROPRIATE - 2 files, clear boundaries
Design: NEEDS IMPROVEMENT - Recommends rename instead of new file
Simplicity: TOO COMPLEX (WORK NOT NEEDED) - Existing file already works
Testing: GAPS IDENTIFIED - No tests for validator
Validation: ADEQUATE - All criteria testable
Tattle-Tale: REJECT - Work not needed, just rename existing file

Claude followed Tattle-Tale's recommendation and renamed instead of creating new file.

V2.6.1 Example (Nov 9, 2025 - Session #2)

Git log excerpt:

2025-11-09 13:20:30 | [Work Unit] WU-TEST-FIX-CATEGORIZER-001 - Fix CategorizerConfig
2025-11-09 13:21:52 | [Investigation Complete] - Bug Description Incorrect
2025-11-09 13:22:51 | [Work Unit Complete] - Investigation Complete
2025-11-09 13:28:25 | [Work Unit] WU-TEST-FIX-PDF-CONFIG-001 - Fix Type Annotations
2025-11-09 13:31:11 | [Bug Fix] WU-TEST-FIX-LLM-CLI-002 - Fix Import Paths

Density improvement: 0.30 commits/min (3x faster than V2.2)

Why faster?

Memory system (added in V2.6.1) preserved context
Less time re-explaining project structure to Claude
Same 7-agent reviews, but faster context gathering

Citation (V2.6.1 migration commit a8da398):

Scripts Added (6):
- background_test_runner.py (with parallel execution prevention)
- query_memory.py (semantic search over work units)
- generate_embeddings.py (index work units/reviews)
- vector_db.py (SQLite vector database)
...
New Features:
- Memory system (optional - activate with generate_embeddings.py)

V2.9 Example (Dec 8, 2025 - 40 min, 48 commits)

Git log excerpt:

2025-12-08 13:40:10 | [WU-009-01] Plan: Add _detect_compound_query() method
2025-12-08 13:40:38 | [WU-005-01] Add MAX_ALLOWED_THRESHOLD = 0.7 constant
2025-12-08 13:41:18 | [WU-005-01] Archive work unit
2025-12-08 13:41:27 | [WU-009-01] Archive work unit
2025-12-08 13:42:12 | [WU-007-02] Add --reset-patterns CLI option
2025-12-08 13:42:14 | [WU-005-02] Archive work unit
2025-12-08 13:43:20 | [WU-007-02] Archive work unit
2025-12-08 13:44:00 | [WU-006-01] Fix warning message consistency

Density: 1.20 commits/min (11x faster than V2.2)

Why dramatically faster?

1. **Builder-only reviews** (60 builder reviews on Dec 8 vs 35 multi-agent reviews on Oct 12)

2. **Graph memory** provides instant dependency awareness

3. **Sprint runner** automates work unit sequencing

4. **Three-tier hierarchy** - human approved Story, Tasks run autonomously

Citation (V2.9 addition commit 214e72b):

- Add planner agent scripts (planner_core.py, planner_output.py, etc.)
- Add sprint runner integration
- Add work unit documentation from completed sprint
- Add graph memory database
- Update workflow config

Citation (V2.8 lean workflow commit bbcb7fe):

### Agent Consolidation (7 → 5)
- Vision: Right problem, project alignment (50 lines)
- Scope: 1-5 files, clear boundaries (50 lines)
- Design: Patterns + simplicity merged (50 lines)
- Testing: Strategy + validation merged (50 lines)
- Tattle-Tale: Cross-review synthesis (80 lines)

V2.9 further reduced to builder-only at Task level.

A.6 Why Density Differs by Version

Version	Reviews/WU	Commits/WU	Est. Time/WU	Density Factor
V2.2	14 (7×2)	~2	~18 min	1.0x (baseline)
V2.6.1	14 (7×2)	~3	~10 min	1.8x
V2.8	10 (5×2)	~2	~8 min	2.3x
V2.9	2 (1×2)	~2	~1.5 min	12x

Key factors:

1. **Review overhead reduction**

2. **Context preservation**

3. **Human approval level**

A.7 Verification

To reproduce this analysis:

# Get all commits with timestamps
cd /Users/user/archiva
git log --format="%ai|%s" --since=2025-09-20 > commits.txt

# Count commits by version period
git log --oneline --since=2025-09-25 --until=2025-10-26 | wc -l # V2.2
git log --oneline --since=2025-12-03 | wc -l # V2.9

# Count agent reviews by type
ls .claude/agent-reviews/ | grep "^builder-" | wc -l
ls .claude/agent-reviews/ | grep "^vision-" | wc -l

# Find version transition commits
git log --oneline --grep="V2\.[0-9]"

A.8 Limitations

1. **30-minute threshold is arbitrary** - Different thresholds would yield different session counts

2. **Commit ≠ work** - Some commits are small (archival), others are large (implementation)

3. **Human involvement not directly measured** - Inferred from gaps and commit patterns

4. **Agent review time not directly timestamped** - Inferred from commit gaps

5. **Version boundaries are approximate** - Workflow changes may have been gradual

Generated: December 9, 2025

Juggling Risk for Fun and Profit

Dec 9, 2025

Evaluation and metrics are crucial for AI workflow evolution

Top 10 Longest Autonomous Sessions by Duration

Summary Table

The V2.9 Difference

Key Insight

Version Distribution

What Enabled Long Sessions by Version

Why V2.9 Sessions Are Shorter But Better

Appendix A: Methodology

A.1 Data Sources

A.2 Session Detection Algorithm

A.3 Version Attribution

A.4 Commit Density Calculation

A.5 Examples from Each Workflow Version

A.6 Why Density Differs by Version

A.7 Verification

A.8 Limitations

No comments:

Post a Comment

About Me

Total Pageviews