Skip to content

Determinism + caching

This page documents the framework's caching and sampling policy. Repeat runs of the same input return cached LLM responses; first-call outputs depend on the LLM API's runtime behavior.

Tier policies

Tier Default model Temperature Notes
Architect claude-sonnet-4-6 0 One call per run.
Manager claude-sonnet-4-6 0 TestArchitect, SecurityArchitect, InfrastructureChoiceArchitect.
ICP claude-haiku-4-5 0.2 Lightly sampled.
Fixer claude-sonnet-4-6 0 Single retry on failed test.

--deterministic pins every tier to temperature=0 and ensures all calls route through the response cache.

Seed values are passed through the framework's gateway abstraction but the Anthropic API does not currently accept a seed parameter; the gateway logs a debug message and drops the seed at the wire. Seeds are retained in the data model for forward compatibility and for the framework's own deterministic stages (architecture serialization, MCDA scoring) which do not invoke the API.

Why temperature=0 alone isn't enough

temperature=0 does not guarantee bit-exact output on the Anthropic API. Floating-point non-associativity in batched inference, server-side sharding, and silent model-version drift all permit different responses to identical inputs.

The framework wraps every gateway in a content-addressed disk cache (CachingLLMGateway), keyed on:

SHA-256(model + system_prompt + user_prompt + temperature + replicate_id)

Behavior:

  • Cache hit — the stored response is loaded from disk and returned byte-for-byte; cost and duration on the response are zeroed; cache_hit=True is set.
  • Cache miss — the API call runs, the response is stored at meta-evaluation-results/cache/<sha256>.json, and the framework proceeds.

Replay stability is therefore a memoization property of the disk cache, not a model-determinism property. Identical inputs against a populated cache return identical outputs. Identical inputs against a cold cache may return different outputs.

Anthropic prompt cache

Separate from the framework's response cache, --prompt-cache (default on) attaches cache_control: {"type": "ephemeral"} to the stable prefix of every call:

  • The system prompt.
  • The agent-spec body (frozen until the framework's spec-library version bumps).
  • Per-run constants (problem description, language profile).

The variable suffix (focal class spec, sibling interfaces) stays uncached. This reduces token cost on Anthropic's side; SUMMARY.md reports per-tier cache-hit ratio and cache_savings_usd.

Cross-cutting concerns

  • Cost budget. --max-cost-usd <N> triggers graceful exit with BUDGET_EXIT.txt and a partial-results report.
  • Resumable runs. --resume <run_dir> re-attaches a crashed run via per-stage CHECKPOINT.json.
  • Replicates. --replicates N runs N seeds and reports mean ± stddev across runs.

See also