Skip to content

Verify a run

A successful generation writes three artifacts you should check before trusting the output: eval_report.json (the metrics), SUMMARY.md (per-tier breakdown), and architecture.squib (what the architect actually decided).

eval_report.json: the metrics bundle

The single source of truth for what happened. Key fields:

Field Meaning
tests_pass Fraction of declared acceptance criteria covered by passing tests, in [0, 1].
architecture_violations Dependency-rule + SOLID + granularity violations against the generated code.
cross_module_dependency_violations Cross-module deps not declared in the target's EXPORTS.
http_convention_violations HTTP headers typed as the wrong shape (caught + retried automatically).
compile_errors Per-language compile/typecheck errors against the generated source.
total_tokens_input / total_tokens_output Cumulative across every agent call.
cache_creation_input_tokens / cache_read_input_tokens Prompt-cache visibility.
estimated_cost_usd Sum across all calls, billed-rate.
total_wall_clock_ms Real wall time including parallelism.
agent_retries Count of agents that had to redo work.
secret_leaks_detected Hits from the secret-scan integration step.

Each field is documented above. What counts as a "good" run for your ProblemSpec depends on your acceptance criteria; the Benchmarks page lists measured tests_pass and cost figures across the canonical problem set.

SUMMARY.md: the cross-problem report

Aggregates eval_report.json from every problem set in the run, plus per-tier latency and cache-hit ratios. Useful when you've run a sweep across problems or replicates.

architecture.squib: what the architect decided

The ArchitectureSpec the architect emitted, before any class implementation runs. If a class belongs to the wrong module, your required_bounded_contexts may be too coarse; split them.

See Squib grammar for how to read it.

When something looks wrong

  1. Read eval_report.json first. Most violations are caught and retried automatically; persistent violations are logged + cause a graceful exit.
  2. Read architecture.squib second. If the architect produced a module structure that doesn't match your domain, edit the ProblemSpec; the spec is the lever the architect responds to.
  3. Use --deterministic to pin all tiers to temperature=0 and route every LLM call through the response cache; identical inputs return cached outputs on later runs.
  4. Use --replicates 5 to surface mean ± stddev of tests_pass if you suspect stochastic drift.

What to do next