Verify a run¶

A successful generation writes three artifacts you should check before trusting the output: eval_report.json (the metrics), SUMMARY.md (per-tier breakdown), and architecture.squib (what the architect actually decided).

eval_report.json: the metrics bundle¶

The single source of truth for what happened. Key fields:

Field	Meaning
`tests_pass`	Fraction of declared acceptance criteria covered by passing tests, in [0, 1].
`architecture_violations`	Dependency-rule + SOLID + granularity violations against the generated code.
`cross_module_dependency_violations`	Cross-module deps not declared in the target's `EXPORTS`.
`http_convention_violations`	HTTP `headers` typed as the wrong shape (caught + retried automatically).
`compile_errors`	Per-language compile/typecheck errors against the generated source.
`total_tokens_input` / `total_tokens_output`	Cumulative across every agent call.
`cache_creation_input_tokens` / `cache_read_input_tokens`	Prompt-cache visibility.
`estimated_cost_usd`	Sum across all calls, billed-rate.
`total_wall_clock_ms`	Real wall time including parallelism.
`agent_retries`	Count of agents that had to redo work.
`secret_leaks_detected`	Hits from the secret-scan integration step.

Each field is documented above. What counts as a "good" run for your ProblemSpec depends on your acceptance criteria; the Benchmarks page lists measured tests_pass and cost figures across the canonical problem set.

SUMMARY.md: the cross-problem report¶

Aggregates eval_report.json from every problem set in the run, plus per-tier latency and cache-hit ratios. Useful when you've run a sweep across problems or replicates.

architecture.squib: what the architect decided¶

The ArchitectureSpec the architect emitted, before any class implementation runs. If a class belongs to the wrong module, your required_bounded_contexts may be too coarse; split them.

See Squib grammar for how to read it.

When something looks wrong¶

Read eval_report.json first. Most violations are caught and retried automatically; persistent violations are logged + cause a graceful exit.
Read architecture.squib second. If the architect produced a module structure that doesn't match your domain, edit the ProblemSpec; the spec is the lever the architect responds to.
Use --deterministic to pin all tiers to temperature=0 and route every LLM call through the response cache; identical inputs return cached outputs on later runs.
Use --replicates 5 to surface mean ± stddev of tests_pass if you suspect stochastic drift.

What to do next¶

Author your first ProblemSpec — what to edit when the architecture isn't what you expected.
Architecture deep-dive — what each tier is responsible for.