Benchmarks¶

This page collects measured performance data from the framework's meta-evaluation harness. Numbers come from specific runs identified by ID. Methodology and the metrics the harness optimizes against are documented below.

What the meta-evaluation harness measures¶

Every run writes an eval_report.json containing the following fields, recorded in EvalMetrics. The framework's optimization loop uses these as objective functions; the user-facing surfaces of the framework treat them as the primary signal for whether a change helped or hurt.

Correctness metrics¶

Field	Meaning	Range
`tests_pass`	Fraction of declared acceptance criteria covered by a passing test in the generated project's `pytest` (or per-language equivalent) suite.	`[0, 1]`
`architecture_violations`	Dependency-rule + SOLID + granularity violations against the generated source.	`≥ 0` (lower better)
`cross_module_dependency_violations`	Cross-module references not declared in the target's `EXPORTS`.	`≥ 0`
`http_convention_violations`	HTTP `headers` typed as the wrong shape, etc. (caught + retried automatically).	`≥ 0`
`compile_errors`	Per-language compile or typecheck errors against the generated source.	`≥ 0`
`secret_leaks_detected`	Hits from the secret-scan integration step.	`≥ 0`
`sast_failed`	True when SAST (`bandit`) flags any HIGH/HIGH finding.	bool

Efficiency metrics¶

Field	Meaning
`total_tokens_input` / `total_tokens_output`	Cumulative across every LLM call in the run.
`cache_creation_input_tokens` / `cache_read_input_tokens`	Anthropic prompt-cache visibility (separate from the framework's response cache).
`total_wall_clock_ms`	Real wall time including parallelism.
`parallelism_limit` / `peak_parallelism`	Configured cap and observed peak concurrent agent calls.
`estimated_cost_usd`	Sum across all calls at the published Anthropic rate at the time of the run.

Decomposition-quality metrics¶

Field	Meaning
`avg_file_line_count` / `max_file_line_count`	Per-file size against the framework's 80-line cap.
`classes_per_module`	Distribution of class counts across modules.
`orphan_files`	Files not imported by anything.

Reliability metrics¶

Field	Meaning
`agent_retries`	Count of agents that had to redo work.
`agent_hangs` / `llm_timeouts`	Agents exceeding their timeout.
`hallucinations`	References to non-existent types or imports caught at validation.

What the framework optimizes for¶

The optimization loop, in priority order:

tests_pass is the primary signal. A spec change that improves architectural elegance but reduces tests_pass is reverted unless the regression is within stochastic-noise bounds (≥ 2σ across replicates, per the RegressionDetector).
architecture_violations and cross_module_dependency_violations at zero. These are gating: a run with non-zero violations is treated as a failed run regardless of tests_pass.
estimated_cost_usd and total_wall_clock_ms minimization, subject to (1) and (2) holding. Cost reductions that hurt tests_pass are reverted.
Cache-hit ratios (cache_read_input_tokens / cache_creation_input_tokens) maximization, because they reduce per-call cost without changing model output.

Replicate variance is tracked: every change should be validated against N ≥ 3 replicates with mean ± 2σ reporting via the --replicates N CLI flag, before claiming a regression or improvement.

Architectural Complexity Score (ACS)¶

Comparing runs across heterogeneous problems (P0 Calculator vs Twitter clone vs Kafka event producer) on raw cost or wall-clock is misleading: a 100-line Calculator and a 100-line Kafka adapter are nominally equal but the latter requires far more architectural decisions. ACS is a complexity denominator that captures architectural decisions made, not text size produced.

ACS combines three independent dimensions, weighted, normalized to 1.0 at P0 Calculator.

Dimension S — Structural complexity (~50% weight)¶

Pre-codegen, derived from ArchitectureSpec:

S = α₁·M + α₂·C + α₃·D + α₄·X + α₅·I

where
  M = module count
  C = class count
  D = dependency edges in the architecture graph (intra-module + cross-module)
  X = cross-module exports (sum of EXPORTS list lengths)
  I = invariant count (across all modules + classes)

defaults: α = [1.0, 0.5, 0.3, 0.4, 0.2]

Dimension G — Codegen complexity (~30% weight)¶

Post-codegen, derived from parsing each generated source file's AST:

G = β₁·H + β₂·N + β₃·V + β₄·E

where
  H = total cyclomatic complexity (sum of McCabe scores)
  N = total AST node count (Halstead-like volume proxy)
  V = total vocabulary count (distinct identifiers)
  E = external SDK API surface (count of distinct symbols imported from non-stdlib)

defaults: β = [0.4, 0.001, 0.05, 0.5]

Dimension P — Constraint complexity (~20% weight)¶

Pre-codegen, derived from ProblemSpec:

P = γ₁·A + γ₂·CC + γ₃·DC + γ₄·IC

where
  A  = acceptance criteria count
  CC = cross-service contracts (produces + consumes)
  DC = data classifications declared
  IC = infrastructure_choices count

defaults: γ = [1.0, 2.0, 1.0, 1.5]

Composite¶

ACS = 0.5·S + 0.3·G + 0.2·P
ACS_normalized = ACS / ACS(P0 Calculator)

P0 Calculator → 1.0 by construction. All other problems score relative to that baseline.

Normalization metrics derived from ACS¶

Metric	Formula	What it measures
`ACS-cost`	`estimated_cost_usd / ACS`	$ per unit of architectural complexity
`ACS-velocity`	`ACS / (total_wall_clock_ms / 1000)`	complexity-units per wall-second
`ACS-tests-pass`	`tests_pass / ACS`	pass rate normalized for problem difficulty
`ACS-tokens`	`(total_tokens_input + total_tokens_output) / ACS`	tokens per complexity unit

Full methodology, including per-problem ACS estimates and the calibration procedure, is in BENCHMARK_METHODOLOGY.md in the framework repo.

Cross-language event-pipeline sweep¶

Single-replicate runs across all six supported languages, producer + persister services, performed on 2026-04-28 against the K-milestone exit-criterion benchmark. Each row corresponds to a directory under meta-evaluation-results/meta-evaluation_<id>_<timestamp>/ containing the run manifest, eval report, and generated source.

Run	Language	Service	tests_pass	Cost (USD)	Tokens	Wall ms
314	Python	Producer	0.40	0.328	112,465	17,979
315	Python	Persister	0.33	0.267	106,550	13,669
316	Java	Producer	0.00	0.274	73,072	34,517
317	Java	Persister	0.00	0.358	104,698	30,016
318	Go	Producer	0.00	0.250	75,469	17,204
319	Go	Persister	0.00	0.214	69,351	15,355
320	Rust	Producer	0.00	0.279	85,109	27,007
321	Rust	Persister	0.00	0.245	76,286	17,809
322	JavaScript	Producer	0.00	0.302	87,085	8,700
323	JavaScript	Persister	0.00	0.237	72,460	11,616
324	TypeScript	Producer	0.00	0.261	78,534	8,996
325	TypeScript	Persister	0.00	0.309	100,900	13,731

The Java/Go/Rust/JS/TS rows report tests_pass=0.00 because their per-language test runners return zero when the toolchain is unavailable in the run environment, rather than crashing the pipeline. Build verification with mvn -q compile succeeded against the Java producer in run 316; other compile-only verifications were not part of this sweep.

Reproducing a run¶

Each run directory contains:

manifest.json — model identifiers per tier, framework SHA, spec-library hashes, replicate ID, platform / Python version.
eval_report.json — full EvalMetrics.
SUMMARY.md — per-tier breakdown.
architecture.squib — the architecture the architect emitted.
The generated project source under src/, tests/, plus per-language manifest files.

To re-run a problem from a manifest, install the framework at the recorded SHA, set ANTHROPIC_API_KEY, and invoke squeaky generate with the recorded --problem-file and --replicates flags. Every LLM call routes through a content-addressed disk cache keyed on SHA-256(model + system_prompt + user_prompt + temperature + replicate_id). On a populated cache, identical inputs return cached outputs byte-for-byte. On a cold cache, the Anthropic API may produce different outputs across runs at temperature=0; the cache provides replay stability, not first-call determinism.