Your first ProblemSpec¶
A walkthrough of authoring a ProblemSpec JSON, with worked examples and anti-patterns.
The minimum viable spec¶
{
"id": "MY_APP",
"tier": 1,
"slug": "my_app",
"description": "One paragraph: what this service does, who consumes it, what bounded contexts it spans.",
"required_bounded_contexts": ["ContextA", "ContextB"],
"acceptance_criteria": [
"Given <input>, When <verb> is called, Then result is <output>",
"Given <bad input>, When <verb> is called, Then an error is raised"
],
"expected_module_count": [2, 4],
"expected_class_count": [8, 16],
"required_patterns": ["Entity", "ValueObject", "UseCase", "Repository"],
"target_language": "python"
}
Six required fields: id, description, acceptance_criteria, required_patterns, target_language, plus tier + slug for the eval-harness-friendly file path.
Best practices¶
1. Acceptance criteria are Gherkin-shaped¶
Every criterion is Given <state>, When <verb> is called, Then <expectation>. The verb is what the architect uses to decide which class owns the method.
- Good:
"Given a user 'alice' and password 'pw1234567', When sign_up is called, Then result is a User"—sign_upbecomes a method on a class. - Bad:
"The user can sign up"— no verb, no expectation, architect can't generate tests. - Bad:
"sign_up('alice', 'pw1234567') returns a User"— too implementation-specific; architect should choose argument shapes.
2. Use required_bounded_contexts to drive module decomposition¶
The architect uses these names verbatim as MODULE names. Pick names that map to single-responsibility business concepts: Auth, Posts, Timeline, Inventory, Billing. Avoid Database, API, Logic — those are layers, not contexts.
3. Declare infrastructure_choices whenever you know the SDK¶
If you know your service uses Kafka + S3, declare it:
"infrastructure_choices": [
{"category": "message_queue_producer", "technology": "kafka", "version_pin": "confluent-kafka==2.5"},
{"category": "blob_storage", "technology": "s3", "version_pin": "boto3==1.34"}
]
This routes the matching classes to Tier C atomic agents which generate adapter code with real SDK calls instead of stubs. Without explicit choices, you can opt into MCDA-driven selection with --infer-infrastructure, but this is exploratory; real production specs declare.
4. Use domain_conventions to encode common semantics¶
See Domain conventions — common social/e-commerce/auth semantics that the architect would otherwise have to re-derive.
5. Cross-service contracts¶
If you're authoring a multi-service distributed system, declare contracts via produces_contracts / consumes_contracts. See Cross-service contracts.
6. Mark sensitive fields with data_classification¶
See Data classification — sensitivity tags ground the SecurityArchitect's concerns in declared sensitivity rather than name-guessing.
Anti-patterns¶
| Don't | Do |
|---|---|
Use criteria like "The system handles errors gracefully" |
"Given an empty body, When parse is called, Then an error is raised" |
| Combine multiple verbs in one criterion | One criterion per verb-shaped behavior |
| Specify implementation details (SQL, JSON parsing inline) | Specify behavior contracts; let the architect decompose |
Set expected_class_count: [50, 100] for a Calculator |
Let the count match the actual decomposition; over-specifying triggers verb-not-in-spec stubs |
Use expected_module_count: [1, 1] for distributed systems |
Multi-context problems span 4–9 modules |
Forget target_language |
Required; the framework can't infer it |
Iterating¶
If a generated run looks wrong:
- Read
eval_report.jsonfortests_pass,architecture_violations,cross_module_dependency_violations,http_convention_violations. Most violations are caught + retried automatically; persistent violations are logged + cause graceful exit. - Read
architecture.squibto see what the architect produced. If a class belongs to the wrong module, yourrequired_bounded_contextsmay be too coarse; split. - Use
--deterministicto pin all tiers totemperature=0and route every LLM call through the response cache; identical inputs return cached outputs on later runs. - Use
--replicates 5to surface mean ± stddev oftests_passif you suspect stochastic drift.
See Verify a run for the full read-the-output guide.
Worked example: Twitter clone¶
Spec is at examples/twitter_clone/twitter_problem.json. Notable decisions:
required_bounded_contexts: ["Auth", "Posts", "Timeline"]— three contexts, the architect produces ~6-9 modules across them.domain_conventions: ["timeline_includes_self", "follow_asymmetric"]— without these, the architect would produce a "tweets-by-followees only" timeline that excludes the user's own posts (real Twitter includes them).query_semantics: [{"use_case": "GetTimelineUseCase", "shape": "self_plus_followees"}]— the architect picks afind_by_authors([self_id, ...followee_ids])repository method.data_classification: [{"field_ref": "User.password_hash", "sensitivity": "credential"}]— SecurityArchitect's concerns ground here.
Yields a working Flask app with port/adapter discipline preserved. Run-level cost and ACS figures are on the Benchmarks page.
See also¶
- Domain conventions — the convention registry.
- Data classification — sensitivity tags.
- Cross-service contracts — distributed-systems specs.
- Squib grammar — what the architect emits.
examples/in the repo — three runnable sample ProblemSpecs.