Verify a run¶
A successful generation writes three artifacts you should check before trusting the output: eval_report.json (the metrics), SUMMARY.md (per-tier breakdown), and architecture.squib (what the architect actually decided).
eval_report.json: the metrics bundle¶
The single source of truth for what happened. Key fields:
| Field | Meaning |
|---|---|
tests_pass |
Fraction of declared acceptance criteria covered by passing tests, in [0, 1]. |
architecture_violations |
Dependency-rule + SOLID + granularity violations against the generated code. |
cross_module_dependency_violations |
Cross-module deps not declared in the target's EXPORTS. |
http_convention_violations |
HTTP headers typed as the wrong shape (caught + retried automatically). |
compile_errors |
Per-language compile/typecheck errors against the generated source. |
total_tokens_input / total_tokens_output |
Cumulative across every agent call. |
cache_creation_input_tokens / cache_read_input_tokens |
Prompt-cache visibility. |
estimated_cost_usd |
Sum across all calls, billed-rate. |
total_wall_clock_ms |
Real wall time including parallelism. |
agent_retries |
Count of agents that had to redo work. |
secret_leaks_detected |
Hits from the secret-scan integration step. |
Each field is documented above. What counts as a "good" run for your ProblemSpec depends on your acceptance criteria; the Benchmarks page lists measured tests_pass and cost figures across the canonical problem set.
SUMMARY.md: the cross-problem report¶
Aggregates eval_report.json from every problem set in the run, plus per-tier latency and cache-hit ratios. Useful when you've run a sweep across problems or replicates.
architecture.squib: what the architect decided¶
The ArchitectureSpec the architect emitted, before any class implementation runs. If a class belongs to the wrong module, your required_bounded_contexts may be too coarse; split them.
See Squib grammar for how to read it.
When something looks wrong¶
- Read
eval_report.jsonfirst. Most violations are caught and retried automatically; persistent violations are logged + cause a graceful exit. - Read
architecture.squibsecond. If the architect produced a module structure that doesn't match your domain, edit theProblemSpec; the spec is the lever the architect responds to. - Use
--deterministicto pin all tiers totemperature=0and route every LLM call through the response cache; identical inputs return cached outputs on later runs. - Use
--replicates 5to surface mean ± stddev oftests_passif you suspect stochastic drift.
What to do next¶
- Author your first ProblemSpec — what to edit when the architecture isn't what you expected.
- Architecture deep-dive — what each tier is responsible for.