Dec 9, 2025

Evaluation and metrics are crucial for AI workflow evolution

 Going to be a short post today with just some raw results from my claude workflow evolution.

Note: Using claude alone won't get you these details. 

Top 10 Longest Autonomous Sessions by Duration

Project: Archiva

Analysis Period: September 20, 2025 - December 9, 2025

Methodology: Sessions defined as commits <30 minutes apart

 

Summary Table

Rank

Version

Duration

Commits

Density

Date

What Was Done

1

V2.2

144 min

16

0.11/min

Oct 12

P4-13 P5-8 documentation validation

2

V2.6.1

132 min

39

0.30/min

Nov 9

Test fix sprint (WU-TEST-FIX series)

3

V2.3.2

115 min

9

0.08/min

Oct 26

CLI remediation Phase 1

4

V2.2

105 min

11

0.11/min

Sep 28

Module reorganization

5

V2.7

88 min

10

0.11/min

Nov 19

PAA OutputFormatter

6

V2.6.1

80 min

10

0.12/min

Nov 8

Test integrity fixes

7

V2.6.1

79 min

18

0.23/min

Nov 7

V2.6.1 workflow deployment

8

V2.4.1

78 min

6

0.08/min

Oct 30

CLI testing (338 tests passing)

9

V2.4.1

77 min

4

0.05/min

Nov 1

Search validation + prompt fixes

10

V2.7

69 min

7

0.10/min

Nov 12

Circuit breaker + RRF fusion


 

The V2.9 Difference

V2.9's longest session is only 64 minutes but look at the commit density:

 

V2.9 Session

Duration

Commits

Density

#1

64 min

39

0.61/min

#3

40 min

48

1.20/min

#6

21 min

29

1.37/min

#8

17 min

25

1.51/min


Key Insight

V2.9 sessions are 5-15x more productive per minute than earlier versions:

 

  • The **40-minute V2.9 session** (#3) produced **48 commits**
  • The **144-minute V2.2 session** (#1) produced only **16 commits**

 

V2.9 completed 3x more commits in 28% of the time.

 

Version Distribution

Version

Sessions in Top 10

Avg Duration

Avg Density

V2.2

2

125 min

0.11/min

V2.3.2

1

115 min

0.08/min

V2.4.1

2

78 min

0.07/min

V2.6.1

3

97 min

0.22/min

V2.7

2

79 min

0.11/min


V2.9 does not appear in top 10 by duration because its architecture favors shorter, denser sessions.

 

What Enabled Long Sessions by Version

Version

Key Enabler

Why Sessions Were Long

V2.2

Work unit structure

Structured but 7-agent review overhead

V2.3.2

Hook improvements

Better validation but same review overhead

V2.4.1

Define-and-deploy agent

Single-session completion but slow

V2.6.1

Memory system

Context preserved, less human re-explanation

V2.7

Memory upgrade

Better retrieval, sustained context


 

Why V2.9 Sessions Are Shorter But Better

Factor

V2.2-V2.7

V2.9

Reviews per work unit

7 agents × 2 phases = 14

Builder only × 2 phases = 2

Human approval level

Per work unit

Per Story/Epic (not per Task)

Context system

None Vector memory

Graph memory

Commit overhead

High (review artifacts)

Low (builder only)


Result: V2.9 completes more work in less time with fewer commits per work unit.

  

Appendix A: Methodology

A.1 Data Sources

All metrics were computed from:

 

1. **Git commit history**: `git log --format="%ai|%s" --since=2025-09-20`

2. **Agent review files**: `.claude/agent-reviews/*.md`

3. **Workflow version commits**: Specific commits marking version transitions

 

A.2 Session Detection Algorithm

Definition: An "autonomous session" is a sequence of commits where
            each consecutive pair has a gap of less than 30 minutes.
 
Algorithm:
1. Sort all commits chronologically (oldest first)
2. Initialize first commit as start of Session 1
3. For each subsequent commit:
   - Calculate gap = (current_timestamp - previous_timestamp)
   - If gap < 30 minutes: add to current session
   - If gap >= 30 minutes: close current session, start new session
4. Calculate session metrics:
   - Duration = last_commit_timestamp - first_commit_timestamp
   - Commit count = number of commits in session
   - Density = commit_count / duration_in_minutes

 

Why 30 minutes?

 

The threshold was chosen because:

  • Claude typically commits within 1-15 minutes when working continuously
  • Gaps >30 minutes almost always indicate Claude waiting for human input
  • This threshold is consistent with the original Archiva autonomous sessions report 

A.3 Version Attribution

Each commit was attributed to a workflow version based on its date:

 

Date Range

Version

Key Characteristics

Sep 20 - Sep 25

Pre-workflow

No structured workflow

Sep 25 - Oct 26

V2.2

Work units, 7-agent reviews

Oct 26 - Oct 30

V2.3.2

Hook improvements

Oct 30 - Nov 6

V2.4.1

Define-and-deploy agent

Nov 6 - Nov 12

V2.6.1

Memory system added

Nov 12 - Nov 27

V2.7

Memory upgrade

Nov 27 - Dec 3

V2.8

Lean workflow (4512 scripts)

Dec 3 - present

V2.9

Graph memory, builder-only reviews


Source: Version transition commits in git history

  

A.4 Commit Density Calculation

Density = Total Commits in Session / Duration in Minutes
 
Example (V2.2, Oct 12):
  - 16 commits over 144 minutes
  - Density = 16 / 144 = 0.11 commits/min
 
Example (V2.9, Dec 8):
  - 48 commits over 40 minutes
  - Density = 48 / 40 = 1.20 commits/min

 

A.5 Examples from Each Workflow Version

V2.2 Example (Oct 12, 2025 - Session #1)

Git log excerpt:

2025-10-12 12:50:56 | [Work Unit] P4-13: Create validate_setup.py for shared Module
2025-10-12 13:01:16 | docs(P4-13): Rename validate_module.py to validate_setup.py
                      ↑ 10 minute gap - includes 7-agent review cycle


Agent reviews generated: 35 review files from Oct 12 session

  • Pattern: `vision-2025-10-12-*.md`, `scope-2025-10-12-*.md`, etc.
  • Each work unit required 7 agents × 2 phases = 14 reviews


Why 10 minutes between commits?

  • 7 agents ran reviews on the work unit plan
  • Claude read and synthesized review findings
  • Implementation took only ~2 minutes
  • Remaining time was review overhead

 

Citation (commit fcfe6ae):

Vision: ALIGNED - Solves real usability issue
Scope: APPROPRIATE - 2 files, clear boundaries
Design: NEEDS IMPROVEMENT - Recommends rename instead of new file
Simplicity: TOO COMPLEX (WORK NOT NEEDED) - Existing file already works
Testing: GAPS IDENTIFIED - No tests for validator
Validation: ADEQUATE - All criteria testable
Tattle-Tale: REJECT - Work not needed, just rename existing file

 

Claude followed Tattle-Tale's recommendation and renamed instead of creating new file.

 

V2.6.1 Example (Nov 9, 2025 - Session #2)

Git log excerpt:

2025-11-09 13:20:30 | [Work Unit] WU-TEST-FIX-CATEGORIZER-001 - Fix CategorizerConfig
2025-11-09 13:21:52 | [Investigation Complete] - Bug Description Incorrect
2025-11-09 13:22:51 | [Work Unit Complete] - Investigation Complete
2025-11-09 13:28:25 | [Work Unit] WU-TEST-FIX-PDF-CONFIG-001 - Fix Type Annotations
2025-11-09 13:31:11 | [Bug Fix] WU-TEST-FIX-LLM-CLI-002 - Fix Import Paths

 

Density improvement: 0.30 commits/min (3x faster than V2.2)

 

Why faster?

  • Memory system (added in V2.6.1) preserved context
  • Less time re-explaining project structure to Claude
  • Same 7-agent reviews, but faster context gathering

 

Citation (V2.6.1 migration commit a8da398):

Scripts Added (6):
- background_test_runner.py (with parallel execution prevention)
- query_memory.py (semantic search over work units)
- generate_embeddings.py (index work units/reviews)
- vector_db.py (SQLite vector database)
...
New Features:
- Memory system (optional - activate with generate_embeddings.py)

 

V2.9 Example (Dec 8, 2025 - 40 min, 48 commits)

Git log excerpt:

2025-12-08 13:40:10 | [WU-009-01] Plan: Add _detect_compound_query() method
2025-12-08 13:40:38 | [WU-005-01] Add MAX_ALLOWED_THRESHOLD = 0.7 constant
2025-12-08 13:41:18 | [WU-005-01] Archive work unit
2025-12-08 13:41:27 | [WU-009-01] Archive work unit
2025-12-08 13:42:12 | [WU-007-02] Add --reset-patterns CLI option
2025-12-08 13:42:14 | [WU-005-02] Archive work unit
2025-12-08 13:43:20 | [WU-007-02] Archive work unit
2025-12-08 13:44:00 | [WU-006-01] Fix warning message consistency

 

Density: 1.20 commits/min (11x faster than V2.2)

 

Why dramatically faster?

 

1. **Builder-only reviews** (60 builder reviews on Dec 8 vs 35 multi-agent reviews on Oct 12)

2. **Graph memory** provides instant dependency awareness

3. **Sprint runner** automates work unit sequencing

4. **Three-tier hierarchy** - human approved Story, Tasks run autonomously

 

Citation (V2.9 addition commit 214e72b):

- Add planner agent scripts (planner_core.py, planner_output.py, etc.)
- Add sprint runner integration
- Add work unit documentation from completed sprint
- Add graph memory database
- Update workflow config

 

Citation (V2.8 lean workflow commit bbcb7fe):

### Agent Consolidation (7 → 5)
- Vision: Right problem, project alignment (50 lines)
- Scope: 1-5 files, clear boundaries (50 lines)
- Design: Patterns + simplicity merged (50 lines)
- Testing: Strategy + validation merged (50 lines)
- Tattle-Tale: Cross-review synthesis (80 lines)

 

V2.9 further reduced to builder-only at Task level.

  

A.6 Why Density Differs by Version

Version

Reviews/WU

Commits/WU

Est. Time/WU

Density Factor

V2.2

14 (7×2)

~2

~18 min

1.0x (baseline)

V2.6.1

14 (7×2)

~3

~10 min

1.8x

V2.8

10 (5×2)

~2

~8 min

2.3x

V2.9

2 (1×2)

~2

~1.5 min

12x

Key factors:

 

1. **Review overhead reduction**

2. **Context preservation**

3. **Human approval level**

 

A.7 Verification

To reproduce this analysis:

 

# Get all commits with timestamps
cd /Users/user/archiva
git log --format="%ai|%s" --since=2025-09-20 > commits.txt
 
# Count commits by version period
git log --oneline --since=2025-09-25 --until=2025-10-26 | wc -l  # V2.2
git log --oneline --since=2025-12-03 | wc -l                      # V2.9
 
# Count agent reviews by type
ls .claude/agent-reviews/ | grep "^builder-" | wc -l
ls .claude/agent-reviews/ | grep "^vision-" | wc -l
 
# Find version transition commits
git log --oneline --grep="V2\.[0-9]"

 


A.8 Limitations

1. **30-minute threshold is arbitrary** - Different thresholds would yield different session counts

2. **Commit ≠ work** - Some commits are small (archival), others are large (implementation)

3. **Human involvement not directly measured** - Inferred from gaps and commit patterns

4. **Agent review time not directly timestamped** - Inferred from commit gaps

5. **Version boundaries are approximate** - Workflow changes may have been gradual

 

Generated: December 9, 2025

No comments:

Post a Comment