Roadmap¶
Public, milestone-level. Last updated 2026-04-29.
Shipped¶
| Milestone | What it does |
|---|---|
| A Measurement Foundation | Per-agent unit evals, replicate runs, regression detection, deterministic mode, cache visibility. |
| B Spec Architecture Cleanup | Shared specs + per-language profiles, prompt caching with cache_control, structured outputs. |
| C Multi-Module Architecture | Multi-MODULE Squib with cross-module DAG validation, per-module worktrees, layered output paths. |
| D Convergent ICP Optimization | DSPy POC closed with INCONCLUSIVE verdict on Haiku 4.5; hand-written ICP specs remain authoritative. |
| E Reliability, Cost, Security | Graceful agent failure, retry policy, cost budget, rate limiting, secret scan, SAST, reproducibility manifest. |
| F Language & Domain Coverage | Go and Rust profiles, sample-domain library (P5 OAuth2), user-supplied ProblemSpec, custom-pattern hook, richer ProblemSpec schema. |
| G Productionization | CI workflow, Dockerfile, JSON logger, latency/cost percentiles, resumable runs, history dashboard, versioned spec library. |
| H Generalized Infrastructure Layer | 60 Tier C ICPs (15 categories × 4 languages) with TechSpec catalog (~130 bundled snapshots), MCDA-driven choice selection, MCP + web-fetch resolver chain with anti-poisoning. |
| K Cross-language end-to-end gaps | Polymorphic class-parser, dependency installer, HTTP-conventions validator, per-module criterion filtering, JS/TS Tier C parity, registry-driven dispatch. Open-source launch blockers closed. |
In progress¶
- Architectural Complexity Score (ACS) — composite metric for normalizing cost/velocity across heterogeneous problems. Implemented; calibrating across the canonical problem set.
Planned (post-launch)¶
- Milestone I — SystemSpec for distributed services. Today, multi-service distributed systems require running each service as a separate ProblemSpec; cross-service contract fidelity is enforced via the registry. SystemSpec will let one declaration cover topology + services + resources together, generating all service codebases in one run.
- Anthropic-only abstraction. The
LLMGatewayport is multi-provider-ready; concrete adapters for OpenAI / Bedrock / local-Llama are post-launch work. PRs welcome. - Hosted dashboard service. A multi-user
meta-evaluation-results/analysis service. Currently the dashboard is per-user static HTML. - Versioned spec library at v1.0. The spec library is currently
0.1.0. Tag a stable v1.0 once the catalog stabilizes after community feedback. - Reduce architect HTTP-type drift. The validator catches it with retry; long-term we want the architect to never need a retry on this class of constraint.
- Per-language Tier C maturity. Today Java/Go/Rust/JS/TS tests_pass=0.00 in our event-pipeline benchmark because per-language test runners report zero on toolchain-availability fallback. Closing this requires CI-environment toolchain pinning + occasionally tightening the language-specific code-emit rules.
Open RFCs¶
The 12 open design questions in docs/infrastructure_layer_design.md §10 are the primary RFC seed material. Highlights:
- Build-time vs eval-time TechSpec resolution
- MCDA weights — problem-specific or framework-default
- Same category supporting concurrent technologies in one project
- SDK breaking changes between bundled and live-fetched
- TechSpecs language-specific or shared
What we won't do¶
- Generate Dockerfiles, Kubernetes manifests, Terraform. Out of scope per the design doc — the framework generates code; provisioning is operator responsibility.
- Generate frontend UIs. UI codegen has different constraints (visual fidelity, design systems) that don't map cleanly onto Clean Architecture's port/adapter discipline. We'd ship a separate tool.
- Do domain inference. "You're building a Twitter clone, so timelines should include self" is exactly what the framework refuses to assume. Domain conventions go in the ProblemSpec; the framework doesn't guess.
Versioning¶
The framework follows semver. v0.x is pre-launch; v1.0 ships when:
- All six languages have
tests_pass > 0on the canonical event-pipeline benchmark. - Spec library is tagged + frozen.
- CI green from a fresh clone with zero env-specific assumptions.
- ≥3 external users have shipped real apps generated by the framework.