Benchmarks¶
This page collects measured performance data from the framework's meta-evaluation harness. Numbers come from specific runs identified by ID. Methodology and the metrics the harness optimizes against are documented below.
What the meta-evaluation harness measures¶
Every run writes an eval_report.json containing the following fields, recorded in EvalMetrics. The framework's optimization loop uses these as objective functions; the user-facing surfaces of the framework treat them as the primary signal for whether a change helped or hurt.
Correctness metrics¶
| Field | Meaning | Range |
|---|---|---|
tests_pass |
Fraction of declared acceptance criteria covered by a passing test in the generated project's pytest (or per-language equivalent) suite. |
[0, 1] |
architecture_violations |
Dependency-rule + SOLID + granularity violations against the generated source. | ≥ 0 (lower better) |
cross_module_dependency_violations |
Cross-module references not declared in the target's EXPORTS. |
≥ 0 |
http_convention_violations |
HTTP headers typed as the wrong shape, etc. (caught + retried automatically). |
≥ 0 |
compile_errors |
Per-language compile or typecheck errors against the generated source. | ≥ 0 |
secret_leaks_detected |
Hits from the secret-scan integration step. | ≥ 0 |
sast_failed |
True when SAST (bandit) flags any HIGH/HIGH finding. |
bool |
Efficiency metrics¶
| Field | Meaning |
|---|---|
total_tokens_input / total_tokens_output |
Cumulative across every LLM call in the run. |
cache_creation_input_tokens / cache_read_input_tokens |
Anthropic prompt-cache visibility (separate from the framework's response cache). |
total_wall_clock_ms |
Real wall time including parallelism. |
parallelism_limit / peak_parallelism |
Configured cap and observed peak concurrent agent calls. |
estimated_cost_usd |
Sum across all calls at the published Anthropic rate at the time of the run. |
Decomposition-quality metrics¶
| Field | Meaning |
|---|---|
avg_file_line_count / max_file_line_count |
Per-file size against the framework's 80-line cap. |
classes_per_module |
Distribution of class counts across modules. |
orphan_files |
Files not imported by anything. |
Reliability metrics¶
| Field | Meaning |
|---|---|
agent_retries |
Count of agents that had to redo work. |
agent_hangs / llm_timeouts |
Agents exceeding their timeout. |
hallucinations |
References to non-existent types or imports caught at validation. |
What the framework optimizes for¶
The optimization loop, in priority order:
tests_passis the primary signal. A spec change that improves architectural elegance but reducestests_passis reverted unless the regression is within stochastic-noise bounds (≥ 2σ across replicates, per theRegressionDetector).architecture_violationsandcross_module_dependency_violationsat zero. These are gating: a run with non-zero violations is treated as a failed run regardless oftests_pass.estimated_cost_usdandtotal_wall_clock_msminimization, subject to (1) and (2) holding. Cost reductions that hurttests_passare reverted.- Cache-hit ratios (
cache_read_input_tokens / cache_creation_input_tokens) maximization, because they reduce per-call cost without changing model output.
Replicate variance is tracked: every change should be validated against N ≥ 3 replicates with mean ± 2σ reporting via the --replicates N CLI flag, before claiming a regression or improvement.
Architectural Complexity Score (ACS)¶
Comparing runs across heterogeneous problems (P0 Calculator vs Twitter clone vs Kafka event producer) on raw cost or wall-clock is misleading: a 100-line Calculator and a 100-line Kafka adapter are nominally equal but the latter requires far more architectural decisions. ACS is a complexity denominator that captures architectural decisions made, not text size produced.
ACS combines three independent dimensions, weighted, normalized to 1.0 at P0 Calculator.
Dimension S — Structural complexity (~50% weight)¶
Pre-codegen, derived from ArchitectureSpec:
S = α₁·M + α₂·C + α₃·D + α₄·X + α₅·I
where
M = module count
C = class count
D = dependency edges in the architecture graph (intra-module + cross-module)
X = cross-module exports (sum of EXPORTS list lengths)
I = invariant count (across all modules + classes)
defaults: α = [1.0, 0.5, 0.3, 0.4, 0.2]
Dimension G — Codegen complexity (~30% weight)¶
Post-codegen, derived from parsing each generated source file's AST:
G = β₁·H + β₂·N + β₃·V + β₄·E
where
H = total cyclomatic complexity (sum of McCabe scores)
N = total AST node count (Halstead-like volume proxy)
V = total vocabulary count (distinct identifiers)
E = external SDK API surface (count of distinct symbols imported from non-stdlib)
defaults: β = [0.4, 0.001, 0.05, 0.5]
Dimension P — Constraint complexity (~20% weight)¶
Pre-codegen, derived from ProblemSpec:
P = γ₁·A + γ₂·CC + γ₃·DC + γ₄·IC
where
A = acceptance criteria count
CC = cross-service contracts (produces + consumes)
DC = data classifications declared
IC = infrastructure_choices count
defaults: γ = [1.0, 2.0, 1.0, 1.5]
Composite¶
ACS = 0.5·S + 0.3·G + 0.2·P
ACS_normalized = ACS / ACS(P0 Calculator)
P0 Calculator → 1.0 by construction. All other problems score relative to that baseline.
Normalization metrics derived from ACS¶
| Metric | Formula | What it measures |
|---|---|---|
ACS-cost |
estimated_cost_usd / ACS |
$ per unit of architectural complexity |
ACS-velocity |
ACS / (total_wall_clock_ms / 1000) |
complexity-units per wall-second |
ACS-tests-pass |
tests_pass / ACS |
pass rate normalized for problem difficulty |
ACS-tokens |
(total_tokens_input + total_tokens_output) / ACS |
tokens per complexity unit |
Full methodology, including per-problem ACS estimates and the calibration procedure, is in BENCHMARK_METHODOLOGY.md in the framework repo.
Cross-language event-pipeline sweep¶
Single-replicate runs across all six supported languages, producer + persister services, performed on 2026-04-28 against the K-milestone exit-criterion benchmark. Each row corresponds to a directory under meta-evaluation-results/meta-evaluation_<id>_<timestamp>/ containing the run manifest, eval report, and generated source.
| Run | Language | Service | tests_pass | Cost (USD) | Tokens | Wall ms |
|---|---|---|---|---|---|---|
| 314 | Python | Producer | 0.40 | 0.328 | 112,465 | 17,979 |
| 315 | Python | Persister | 0.33 | 0.267 | 106,550 | 13,669 |
| 316 | Java | Producer | 0.00 | 0.274 | 73,072 | 34,517 |
| 317 | Java | Persister | 0.00 | 0.358 | 104,698 | 30,016 |
| 318 | Go | Producer | 0.00 | 0.250 | 75,469 | 17,204 |
| 319 | Go | Persister | 0.00 | 0.214 | 69,351 | 15,355 |
| 320 | Rust | Producer | 0.00 | 0.279 | 85,109 | 27,007 |
| 321 | Rust | Persister | 0.00 | 0.245 | 76,286 | 17,809 |
| 322 | JavaScript | Producer | 0.00 | 0.302 | 87,085 | 8,700 |
| 323 | JavaScript | Persister | 0.00 | 0.237 | 72,460 | 11,616 |
| 324 | TypeScript | Producer | 0.00 | 0.261 | 78,534 | 8,996 |
| 325 | TypeScript | Persister | 0.00 | 0.309 | 100,900 | 13,731 |
The Java/Go/Rust/JS/TS rows report tests_pass=0.00 because their per-language test runners return zero when the toolchain is unavailable in the run environment, rather than crashing the pipeline. Build verification with mvn -q compile succeeded against the Java producer in run 316; other compile-only verifications were not part of this sweep.
Reproducing a run¶
Each run directory contains:
manifest.json— model identifiers per tier, framework SHA, spec-library hashes, replicate ID, platform / Python version.eval_report.json— fullEvalMetrics.SUMMARY.md— per-tier breakdown.architecture.squib— the architecture the architect emitted.- The generated project source under
src/,tests/, plus per-language manifest files.
To re-run a problem from a manifest, install the framework at the recorded SHA, set ANTHROPIC_API_KEY, and invoke squeaky generate with the recorded --problem-file and --replicates flags. Every LLM call routes through a content-addressed disk cache keyed on SHA-256(model + system_prompt + user_prompt + temperature + replicate_id). On a populated cache, identical inputs return cached outputs byte-for-byte. On a cold cache, the Anthropic API may produce different outputs across runs at temperature=0; the cache provides replay stability, not first-call determinism.
See also¶
BENCHMARK_METHODOLOGY.mdin the framework repository — full ACS methodology, calibration procedure, and per-problem score estimates.- The public roadmap for milestone status.