Going to be a short post today with just some raw results from my claude workflow evolution.
Note: Using claude alone won't get you these details.
Top 10 Longest Autonomous Sessions by Duration
Project: Archiva
Analysis Period: September 20, 2025 - December 9, 2025
Methodology: Sessions defined as commits <30 minutes apart
Summary Table
Rank | Version | Duration | Commits | Density | Date | What Was Done |
1 | V2.2 | 144 min | 16 | 0.11/min | Oct 12 | P4-13 → P5-8 documentation validation |
2 | V2.6.1 | 132 min | 39 | 0.30/min | Nov 9 | Test fix sprint (WU-TEST-FIX series) |
3 | V2.3.2 | 115 min | 9 | 0.08/min | Oct 26 | CLI remediation Phase 1 |
4 | V2.2 | 105 min | 11 | 0.11/min | Sep 28 | Module reorganization |
5 | V2.7 | 88 min | 10 | 0.11/min | Nov 19 | PAA OutputFormatter |
6 | V2.6.1 | 80 min | 10 | 0.12/min | Nov 8 | Test integrity fixes |
7 | V2.6.1 | 79 min | 18 | 0.23/min | Nov 7 | V2.6.1 workflow deployment |
8 | V2.4.1 | 78 min | 6 | 0.08/min | Oct 30 | CLI testing (338 tests passing) |
9 | V2.4.1 | 77 min | 4 | 0.05/min | Nov 1 | Search validation + prompt fixes |
10 | V2.7 | 69 min | 7 | 0.10/min | Nov 12 | Circuit breaker + RRF fusion |
The V2.9 Difference
V2.9's longest session is only 64 minutes but look at the commit density:
V2.9 Session | Duration | Commits | Density |
#1 | 64 min | 39 | 0.61/min |
#3 | 40 min | 48 | 1.20/min |
#6 | 21 min | 29 | 1.37/min |
#8 | 17 min | 25 | 1.51/min |
Key Insight
V2.9 sessions are 5-15x more productive per minute than earlier versions:
- The **40-minute V2.9 session** (#3) produced **48 commits**
- The **144-minute V2.2 session** (#1) produced only **16 commits**
V2.9 completed 3x more commits in 28% of the time.
Version Distribution
Version | Sessions in Top 10 | Avg Duration | Avg Density |
V2.2 | 2 | 125 min | 0.11/min |
V2.3.2 | 1 | 115 min | 0.08/min |
V2.4.1 | 2 | 78 min | 0.07/min |
V2.6.1 | 3 | 97 min | 0.22/min |
V2.7 | 2 | 79 min | 0.11/min |
V2.9 does not appear in top 10 by duration because its architecture favors shorter, denser sessions.
What Enabled Long Sessions by Version
Version | Key Enabler | Why Sessions Were Long |
V2.2 | Work unit structure | Structured but 7-agent review overhead |
V2.3.2 | Hook improvements | Better validation but same review overhead |
V2.4.1 | Define-and-deploy agent | Single-session completion but slow |
V2.6.1 | Memory system | Context preserved, less human re-explanation |
V2.7 | Memory upgrade | Better retrieval, sustained context |
Why V2.9 Sessions Are Shorter But Better
Factor | V2.2-V2.7 | V2.9 |
Reviews per work unit | 7 agents × 2 phases = 14 | Builder only × 2 phases = 2 |
Human approval level | Per work unit | Per Story/Epic (not per Task) |
Context system | None → Vector memory | Graph memory |
Commit overhead | High (review artifacts) | Low (builder only) |
Result: V2.9 completes more work in less time with fewer commits per work unit.
Appendix A: Methodology
A.1 Data Sources
All metrics were computed from:
1. **Git commit history**: `git log --format="%ai|%s" --since=2025-09-20`
2. **Agent review files**: `.claude/agent-reviews/*.md`
3. **Workflow version commits**: Specific commits marking version transitions
A.2 Session Detection Algorithm
Definition: An "autonomous session" is a sequence of commits where
each consecutive pair has a gap of less than 30 minutes.
Algorithm:
1. Sort all commits chronologically (oldest first)
2. Initialize first commit as start of Session 1
3. For each subsequent commit:
- Calculate gap = (current_timestamp - previous_timestamp)
- If gap < 30 minutes: add to current session
- If gap >= 30 minutes: close current session, start new session
4. Calculate session metrics:
- Duration = last_commit_timestamp - first_commit_timestamp
- Commit count = number of commits in session
- Density = commit_count / duration_in_minutes
Why 30 minutes?
The threshold was chosen because:
- Claude typically commits within 1-15 minutes when working continuously
- Gaps >30 minutes almost always indicate Claude waiting for human input
- This threshold is consistent with the original Archiva autonomous sessions report
A.3 Version Attribution
Each commit was attributed to a workflow version based on its date:
Date Range | Version | Key Characteristics |
Sep 20 - Sep 25 | Pre-workflow | No structured workflow |
Sep 25 - Oct 26 | V2.2 | Work units, 7-agent reviews |
Oct 26 - Oct 30 | V2.3.2 | Hook improvements |
Oct 30 - Nov 6 | V2.4.1 | Define-and-deploy agent |
Nov 6 - Nov 12 | V2.6.1 | Memory system added |
Nov 12 - Nov 27 | V2.7 | Memory upgrade |
Nov 27 - Dec 3 | V2.8 | Lean workflow (45→12 scripts) |
Dec 3 - present | V2.9 | Graph memory, builder-only reviews |
Source: Version transition commits in git history
A.4 Commit Density Calculation
Density = Total Commits in Session / Duration in Minutes
Example (V2.2, Oct 12):
- 16 commits over 144 minutes
- Density = 16 / 144 = 0.11 commits/min
Example (V2.9, Dec 8):
- 48 commits over 40 minutes
- Density = 48 / 40 = 1.20 commits/min
A.5 Examples from Each Workflow Version
V2.2 Example (Oct 12, 2025 - Session #1)
Git log excerpt:
2025-10-12 12:50:56 | [Work Unit] P4-13: Create validate_setup.py for shared Module
2025-10-12 13:01:16 | docs(P4-13): Rename validate_module.py to validate_setup.py
↑ 10 minute gap - includes 7-agent review cycle
Agent reviews generated: 35 review files from Oct 12 session
- Pattern: `vision-2025-10-12-*.md`, `scope-2025-10-12-*.md`, etc.
- Each work unit required 7 agents × 2 phases = 14 reviews
Why 10 minutes between commits?
- 7 agents ran reviews on the work unit plan
- Claude read and synthesized review findings
- Implementation took only ~2 minutes
- Remaining time was review overhead
Citation (commit fcfe6ae):
Vision: ALIGNED - Solves real usability issue
Scope: APPROPRIATE - 2 files, clear boundaries
Design: NEEDS IMPROVEMENT - Recommends rename instead of new file
Simplicity: TOO COMPLEX (WORK NOT NEEDED) - Existing file already works
Testing: GAPS IDENTIFIED - No tests for validator
Validation: ADEQUATE - All criteria testable
Tattle-Tale: REJECT - Work not needed, just rename existing file
Claude followed Tattle-Tale's recommendation and renamed instead of creating new file.
V2.6.1 Example (Nov 9, 2025 - Session #2)
Git log excerpt:
2025-11-09 13:20:30 | [Work Unit] WU-TEST-FIX-CATEGORIZER-001 - Fix CategorizerConfig
2025-11-09 13:21:52 | [Investigation Complete] - Bug Description Incorrect
2025-11-09 13:22:51 | [Work Unit Complete] - Investigation Complete
2025-11-09 13:28:25 | [Work Unit] WU-TEST-FIX-PDF-CONFIG-001 - Fix Type Annotations
2025-11-09 13:31:11 | [Bug Fix] WU-TEST-FIX-LLM-CLI-002 - Fix Import Paths
Density improvement: 0.30 commits/min (3x faster than V2.2)
Why faster?
- Memory system (added in V2.6.1) preserved context
- Less time re-explaining project structure to Claude
- Same 7-agent reviews, but faster context gathering
Citation (V2.6.1 migration commit a8da398):
Scripts Added (6):
- background_test_runner.py (with parallel execution prevention)
- query_memory.py (semantic search over work units)
- generate_embeddings.py (index work units/reviews)
- vector_db.py (SQLite vector database)
...
New Features:
- Memory system (optional - activate with generate_embeddings.py)
V2.9 Example (Dec 8, 2025 - 40 min, 48 commits)
Git log excerpt:
2025-12-08 13:40:10 | [WU-009-01] Plan: Add _detect_compound_query() method
2025-12-08 13:40:38 | [WU-005-01] Add MAX_ALLOWED_THRESHOLD = 0.7 constant
2025-12-08 13:41:18 | [WU-005-01] Archive work unit
2025-12-08 13:41:27 | [WU-009-01] Archive work unit
2025-12-08 13:42:12 | [WU-007-02] Add --reset-patterns CLI option
2025-12-08 13:42:14 | [WU-005-02] Archive work unit
2025-12-08 13:43:20 | [WU-007-02] Archive work unit
2025-12-08 13:44:00 | [WU-006-01] Fix warning message consistency
Density: 1.20 commits/min (11x faster than V2.2)
Why dramatically faster?
1. **Builder-only reviews** (60 builder reviews on Dec 8 vs 35 multi-agent reviews on Oct 12)
2. **Graph memory** provides instant dependency awareness
3. **Sprint runner** automates work unit sequencing
4. **Three-tier hierarchy** - human approved Story, Tasks run autonomously
Citation (V2.9 addition commit 214e72b):
- Add planner agent scripts (planner_core.py, planner_output.py, etc.)
- Add sprint runner integration
- Add work unit documentation from completed sprint
- Add graph memory database
- Update workflow config
Citation (V2.8 lean workflow commit bbcb7fe):
### Agent Consolidation (7 → 5)
- Vision: Right problem, project alignment (50 lines)
- Scope: 1-5 files, clear boundaries (50 lines)
- Design: Patterns + simplicity merged (50 lines)
- Testing: Strategy + validation merged (50 lines)
- Tattle-Tale: Cross-review synthesis (80 lines)
V2.9 further reduced to builder-only at Task level.
A.6 Why Density Differs by Version
Version | Reviews/WU | Commits/WU | Est. Time/WU | Density Factor |
V2.2 | 14 (7×2) | ~2 | ~18 min | 1.0x (baseline) |
V2.6.1 | 14 (7×2) | ~3 | ~10 min | 1.8x |
V2.8 | 10 (5×2) | ~2 | ~8 min | 2.3x |
V2.9 | 2 (1×2) | ~2 | ~1.5 min | 12x |
Key factors:
1. **Review overhead reduction**
2. **Context preservation**
3. **Human approval level**
A.7 Verification
To reproduce this analysis:
# Get all commits with timestamps
cd /Users/user/archiva
git log --format="%ai|%s" --since=2025-09-20 > commits.txt
# Count commits by version period
git log --oneline --since=2025-09-25 --until=2025-10-26 | wc -l # V2.2
git log --oneline --since=2025-12-03 | wc -l # V2.9
# Count agent reviews by type
ls .claude/agent-reviews/ | grep "^builder-" | wc -l
ls .claude/agent-reviews/ | grep "^vision-" | wc -l
# Find version transition commits
git log --oneline --grep="V2\.[0-9]"
A.8 Limitations
1. **30-minute threshold is arbitrary** - Different thresholds would yield different session counts
2. **Commit ≠ work** - Some commits are small (archival), others are large (implementation)
3. **Human involvement not directly measured** - Inferred from gaps and commit patterns
4. **Agent review time not directly timestamped** - Inferred from commit gaps
5. **Version boundaries are approximate** - Workflow changes may have been gradual
Generated: December 9, 2025
No comments:
Post a Comment