Experiment Results
We ran three comparative experiments to benchmark different approaches to agentic development. Each experiment tested multiple strategies on controlled tasks.
Experiment 1: Prompting Approaches
Section titled “Experiment 1: Prompting Approaches”Task: Create a TypeScript utility function that validates and parses ISO 8601 date strings with timezone support.
Approaches Tested
Section titled “Approaches Tested”| Approach | Description |
|---|---|
| Minimal | ”Create a TypeScript function to validate and parse ISO 8601 dates” |
| Context-Rich | Detailed requirements, edge cases, return type, pattern references |
| Spec-Driven + TDD | Write tests first, confirm failure, implement to pass, refactor |
Results
Section titled “Results”| Metric | Minimal | Context-Rich | Spec+TDD |
|---|---|---|---|
| Completeness | 4/10 | 7/10 | 9/10 |
| Edge Case Handling | 2/10 | 6/10 | 9/10 |
| Test Coverage | 1/10 | 5/10 | 9/10 |
| Code Quality | 5/10 | 7/10 | 8/10 |
| Est. Token Cost | ~2,000 | ~5,000 | ~12,000 |
Key Findings
Section titled “Key Findings”-
Minimal prompts produce superficially correct code that misses edge cases, has no tests, and assumes simple inputs.
-
Context-rich prompts significantly improve completeness and quality but still leave gaps in edge cases and testing unless specifically requested.
-
Spec-driven + TDD produces the most robust output. The test-first approach forces comprehensive coverage, and the spec provides unambiguous requirements. Token cost is higher, but the output is production-ready.
Experiment 2: Context Management Strategies
Section titled “Experiment 2: Context Management Strategies”Simulated scenario: Working in a 500+ file codebase across a multi-step implementation.
Strategies Tested
Section titled “Strategies Tested”| Strategy | Description |
|---|---|
| Monolithic | Everything in one 300+ line agent configuration file |
| Hierarchical | Root (~50 lines) + domain configuration files + skills |
| Progressive + FIC | Minimal root (~30 lines) + phases with compaction + sub-agents |
Results
Section titled “Results”| Metric | Monolithic | Hierarchical | Progressive+FIC |
|---|---|---|---|
| Instruction Adherence | 71% | 89% | 94% |
| Context Efficiency | Low (70-90% fill) | Medium (50-70%) | High (35-55%) |
| Error Rate | High | Low | Lowest |
| Setup Complexity | Low | Medium | High |
| Maintenance | High (one big file) | Medium | Low (modular) |
Key Findings
Section titled “Key Findings”-
Monolithic degrades rapidly beyond 200 lines. Important rules get “lost” in the noise. Only viable for small projects.
-
Hierarchical is the best balance for most teams. Auto-loading domain-specific configuration files provides relevant context without bloating every session.
-
Progressive + FIC achieves the highest quality but requires discipline. The three-phase workflow with compaction between phases keeps context consistently clean.
Recommendation: Start with Hierarchical. Adopt FIC practices (sub-agents for research, compaction between phases) as your team matures. For tool-specific setup of hierarchical context structures, see the Tool Configuration Reference.
Experiment 3: Multi-Agent Orchestration
Section titled “Experiment 3: Multi-Agent Orchestration”Task: Implement a payment processing feature requiring research, planning, implementation, and review.
Patterns Tested
Section titled “Patterns Tested”| Pattern | Description |
|---|---|
| Single Agent | One agent handles everything |
| Hierarchical | Lead agent + research/implementation sub-agents |
| Pipeline | Sequential specialized agents with file-based handoff |
Results
Section titled “Results”| Metric | Single | Hierarchical | Pipeline |
|---|---|---|---|
| Token Efficiency | 1x (baseline) | 0.7x (-30%) | 0.5x (-50%) |
| Output Quality | Degrades | High, consistent | Highest |
| Context Purity | Low | High | Maximum |
| Wall-Clock Time | Baseline | ~0.7x (faster) | ~1.2x (slower) |
| Coordination Overhead | None | Low | Medium |
| Information Preservation | Full | Good (some summary loss) | Moderate (lossy) |
Key Findings
Section titled “Key Findings”-
Single Agent works well for simple tasks but degrades on complex features. Context pollution from research and failed approaches reduces implementation quality.
-
Hierarchical is the best default pattern. The 30% token savings and consistent quality justify the minimal coordination overhead. Parallel research sub-agents significantly reduce wall-clock time.
-
Pipeline achieves the highest quality through maximum context purity but is slower due to sequential execution. Best for quality-critical implementations where correctness matters more than speed.
Architecture Decision Matrix
Section titled “Architecture Decision Matrix”| Project Size | Task Complexity | Recommended Pattern |
|---|---|---|
| Small (under 100 files) | Simple | Single Agent |
| Small | Complex | Hierarchical |
| Medium (100-500 files) | Any | Hierarchical |
| Large (500+ files) | Simple | Hierarchical |
| Large | Complex | Pipeline or Hybrid |
| Any | Quality-critical | Pipeline |
Methodology Notes
Section titled “Methodology Notes”- Experiments were conducted using AI coding agents
- Each approach was tested with the same base task and evaluated on consistent criteria
- Scores are relative comparisons, not absolute quality measurements
- Token costs are estimates based on typical patterns, not exact measurements
- Results should be validated against your specific project context and tooling