Skip to content

Benchmarks

This page collects measured performance data from the framework's meta-evaluation harness. Numbers come from specific runs identified by ID. Methodology and the metrics the harness optimizes against are documented below.

What the meta-evaluation harness measures

Every run writes an eval_report.json containing the following fields, recorded in EvalMetrics. The framework's optimization loop uses these as objective functions; the user-facing surfaces of the framework treat them as the primary signal for whether a change helped or hurt.

Correctness metrics

Field Meaning Range
tests_pass Fraction of declared acceptance criteria covered by a passing test in the generated project's pytest (or per-language equivalent) suite. [0, 1]
architecture_violations Dependency-rule + SOLID + granularity violations against the generated source. ≥ 0 (lower better)
cross_module_dependency_violations Cross-module references not declared in the target's EXPORTS. ≥ 0
http_convention_violations HTTP headers typed as the wrong shape, etc. (caught + retried automatically). ≥ 0
compile_errors Per-language compile or typecheck errors against the generated source. ≥ 0
secret_leaks_detected Hits from the secret-scan integration step. ≥ 0
sast_failed True when SAST (bandit) flags any HIGH/HIGH finding. bool

Efficiency metrics

Field Meaning
total_tokens_input / total_tokens_output Cumulative across every LLM call in the run.
cache_creation_input_tokens / cache_read_input_tokens Anthropic prompt-cache visibility (separate from the framework's response cache).
total_wall_clock_ms Real wall time including parallelism.
parallelism_limit / peak_parallelism Configured cap and observed peak concurrent agent calls.
estimated_cost_usd Sum across all calls at the published Anthropic rate at the time of the run.

Decomposition-quality metrics

Field Meaning
avg_file_line_count / max_file_line_count Per-file size against the framework's 80-line cap.
classes_per_module Distribution of class counts across modules.
orphan_files Files not imported by anything.

Reliability metrics

Field Meaning
agent_retries Count of agents that had to redo work.
agent_hangs / llm_timeouts Agents exceeding their timeout.
hallucinations References to non-existent types or imports caught at validation.

What the framework optimizes for

The optimization loop, in priority order:

  1. tests_pass is the primary signal. A spec change that improves architectural elegance but reduces tests_pass is reverted unless the regression is within stochastic-noise bounds (≥ 2σ across replicates, per the RegressionDetector).
  2. architecture_violations and cross_module_dependency_violations at zero. These are gating: a run with non-zero violations is treated as a failed run regardless of tests_pass.
  3. estimated_cost_usd and total_wall_clock_ms minimization, subject to (1) and (2) holding. Cost reductions that hurt tests_pass are reverted.
  4. Cache-hit ratios (cache_read_input_tokens / cache_creation_input_tokens) maximization, because they reduce per-call cost without changing model output.

Replicate variance is tracked: every change should be validated against N ≥ 3 replicates with mean ± 2σ reporting via the --replicates N CLI flag, before claiming a regression or improvement.

Architectural Complexity Score (ACS)

Comparing runs across heterogeneous problems (P0 Calculator vs Twitter clone vs Kafka event producer) on raw cost or wall-clock is misleading: a 100-line Calculator and a 100-line Kafka adapter are nominally equal but the latter requires far more architectural decisions. ACS is a complexity denominator that captures architectural decisions made, not text size produced.

ACS combines three independent dimensions, weighted, normalized to 1.0 at P0 Calculator.

Dimension S — Structural complexity (~50% weight)

Pre-codegen, derived from ArchitectureSpec:

S = α₁·M + α₂·C + α₃·D + α₄·X + α₅·I

where
  M = module count
  C = class count
  D = dependency edges in the architecture graph (intra-module + cross-module)
  X = cross-module exports (sum of EXPORTS list lengths)
  I = invariant count (across all modules + classes)

defaults: α = [1.0, 0.5, 0.3, 0.4, 0.2]

Dimension G — Codegen complexity (~30% weight)

Post-codegen, derived from parsing each generated source file's AST:

G = β₁·H + β₂·N + β₃·V + β₄·E

where
  H = total cyclomatic complexity (sum of McCabe scores)
  N = total AST node count (Halstead-like volume proxy)
  V = total vocabulary count (distinct identifiers)
  E = external SDK API surface (count of distinct symbols imported from non-stdlib)

defaults: β = [0.4, 0.001, 0.05, 0.5]

Dimension P — Constraint complexity (~20% weight)

Pre-codegen, derived from ProblemSpec:

P = γ₁·A + γ₂·CC + γ₃·DC + γ₄·IC

where
  A  = acceptance criteria count
  CC = cross-service contracts (produces + consumes)
  DC = data classifications declared
  IC = infrastructure_choices count

defaults: γ = [1.0, 2.0, 1.0, 1.5]

Composite

ACS = 0.5·S + 0.3·G + 0.2·P
ACS_normalized = ACS / ACS(P0 Calculator)

P0 Calculator → 1.0 by construction. All other problems score relative to that baseline.

Normalization metrics derived from ACS

Metric Formula What it measures
ACS-cost estimated_cost_usd / ACS $ per unit of architectural complexity
ACS-velocity ACS / (total_wall_clock_ms / 1000) complexity-units per wall-second
ACS-tests-pass tests_pass / ACS pass rate normalized for problem difficulty
ACS-tokens (total_tokens_input + total_tokens_output) / ACS tokens per complexity unit

Full methodology, including per-problem ACS estimates and the calibration procedure, is in BENCHMARK_METHODOLOGY.md in the framework repo.

Cross-language event-pipeline sweep

Single-replicate runs across all six supported languages, producer + persister services, performed on 2026-04-28 against the K-milestone exit-criterion benchmark. Each row corresponds to a directory under meta-evaluation-results/meta-evaluation_<id>_<timestamp>/ containing the run manifest, eval report, and generated source.

Run Language Service tests_pass Cost (USD) Tokens Wall ms
314 Python Producer 0.40 0.328 112,465 17,979
315 Python Persister 0.33 0.267 106,550 13,669
316 Java Producer 0.00 0.274 73,072 34,517
317 Java Persister 0.00 0.358 104,698 30,016
318 Go Producer 0.00 0.250 75,469 17,204
319 Go Persister 0.00 0.214 69,351 15,355
320 Rust Producer 0.00 0.279 85,109 27,007
321 Rust Persister 0.00 0.245 76,286 17,809
322 JavaScript Producer 0.00 0.302 87,085 8,700
323 JavaScript Persister 0.00 0.237 72,460 11,616
324 TypeScript Producer 0.00 0.261 78,534 8,996
325 TypeScript Persister 0.00 0.309 100,900 13,731

The Java/Go/Rust/JS/TS rows report tests_pass=0.00 because their per-language test runners return zero when the toolchain is unavailable in the run environment, rather than crashing the pipeline. Build verification with mvn -q compile succeeded against the Java producer in run 316; other compile-only verifications were not part of this sweep.

Reproducing a run

Each run directory contains:

  • manifest.json — model identifiers per tier, framework SHA, spec-library hashes, replicate ID, platform / Python version.
  • eval_report.json — full EvalMetrics.
  • SUMMARY.md — per-tier breakdown.
  • architecture.squib — the architecture the architect emitted.
  • The generated project source under src/, tests/, plus per-language manifest files.

To re-run a problem from a manifest, install the framework at the recorded SHA, set ANTHROPIC_API_KEY, and invoke squeaky generate with the recorded --problem-file and --replicates flags. Every LLM call routes through a content-addressed disk cache keyed on SHA-256(model + system_prompt + user_prompt + temperature + replicate_id). On a populated cache, identical inputs return cached outputs byte-for-byte. On a cold cache, the Anthropic API may produce different outputs across runs at temperature=0; the cache provides replay stability, not first-call determinism.

See also

  • BENCHMARK_METHODOLOGY.md in the framework repository — full ACS methodology, calibration procedure, and per-problem score estimates.
  • The public roadmap for milestone status.