This benchmark measures how well language models can generate valid plumbing programs — typed dataflow graphs in the plumb language — from natural-language descriptions. It tests the model's ability to produce syntactically correct, type-safe, structurally sound code in a domain-specific language that will not appear in any model's training data.
Plumbing is a typed composition language for wiring AI agents into pipelines. A plumbing program declares:
type Review = { draft: string, score: int, feedback: string })let writer : !string -> !Review = agent { ... })plumb binding that wires agents together
using sequential composition (;), fan-out, merge, filters, and
field projectionsThe !T notation denotes a stream of messages of type T. The type
checker enforces that connected ports have compatible types.
Each trial runs a plumbing pipeline that contains a single agent
called generator. This agent receives the scenario prompt as input
and produces text output. The agent's system prompt comprises three
documents, concatenated in order:
plumbing code block) and guidance on writing correct
programsThis is a one-shot generation benchmark. The model receives the system prompt and a single user message (the scenario description), and must produce a complete, correct plumbing program in its response. There is no multi-turn interaction, no iterative refinement, and no tool use.
Models are selected via environment variables:
PLUMB_PROVIDER — the API provider (anthropic or openai)PLUMB_MODEL — the model identifierPLUMB_ENDPOINT — optional endpoint override (used for OpenRouter)Anthropic and OpenAI models are called via their respective APIs directly. All other models are accessed via OpenRouter, which proxies requests to the original vendor endpoints (e.g. Mistral models are served by Mistral, Qwen by Alibaba, GLM by Zhipu). OpenRouter is configured to route to the vendor's own inference infrastructure where available, so accuracy results should be representative of the underlying models. However, timing figures for OpenRouter models are not comparable to direct-API models or to each other — they include OpenRouter's routing overhead, and the underlying serving infrastructure (hardware, quantisation, batching) varies by vendor.
Each scenario runs 10 independent trials per model. Each trial:
Trials within a scenario run concurrently. Results are aggregated with bootstrap confidence intervals (2000 resamples, 95% CI).
The benchmark comprises four scenarios of increasing complexity. All share the same generator pipeline and system prompt; only the natural-language task description differs.
Connect an agent called
writerto an agent callededitor. The writer takes text input and produces text output. The editor takes text input and produces text output.
Tests: basic let declarations, agent syntax, sequential
composition with ;, correct plumb binding with input and
output ports.
Expected: agents writer and editor; binding main.
Define a record type called
Articlewith fieldstitle(string) andbody(string). Create two agents:summarisertakes!Articleinput and produces!stringoutput;translatortakes!Articleinput and produces!stringoutput. Wire the pipeline so that input fans out to both agents in parallel. Both agents' outputs merge into the pipeline output. The main binding should have type!Article -> !string.
Tests: custom record type declaration, fan-out (two chains from
input), merge (two chains to output), type annotation on the main
binding.
Expected: agents summariser and translator; binding main.
Define a record type
Reviewwith fieldsdraft(string),score(int), andfeedback(string). Create two agents:writertakes!stringinput and produces!Reviewoutput;editortakes!Reviewinput and produces!Reviewoutput. Wire the pipeline: input goes towriter, then toeditor. Aftereditor, usefilter(score >= 80)to pass only high-scoring reviews, then.draftto extract the draft field and send it to output. The main binding should have type!string -> !string.
Tests: record type, sequential agent chain, inline filter()
with a comparison expression, field projection (.draft), type
narrowing from !Review through projection back to !string.
Expected: agents writer and editor; binding main.
Define a record type
Draftwith fieldstext(string) andapproved(bool). Create two agents:composertakes!stringinput and produces!Draftoutput;checkertakes!Draftinput and produces!Draftoutput. Wire a feedback loop: input goes tocomposer, then tochecker. Fromchecker, usefilter(approved = true)then.textto send approved drafts to output. Usefilter(approved = false)then.textto feed rejected drafts back tocomposer. The main binding should have type!string -> !string.
Tests: feedback cycle (output of checker routed back to
composer), complementary filters on a boolean field, field
projection on both branches, correct typing through the loop.
Expected: agents composer and checker; binding main.
Each trial produces five binary scores:
| Score | What it measures |
|---|---|
| plumbing_extraction | Could a plumbing program be extracted from the response? Tries plumbing-tagged fence first, then any fenced block, then raw text. |
| plumbing_parse | Does the extracted text parse as valid plumbing syntax? |
| plumbing_typecheck | Does the parsed program pass the type checker? |
| plumbing_properties | Does the program contain the expected agents and bindings? |
| plumbing_degenerate | Is the program non-trivial? (A program with zero agent bindings that still type-checks is degenerate.) |
A trial passes if all five scores are 1.0. The scores form a cascade: extraction failure prevents parse evaluation; parse failure prevents typecheck; and so on. In practice, extraction almost always succeeds (the model produces something), so the discriminating scores are parse and typecheck.
This benchmark is specifically designed to test one-shot structured code generation in an unfamiliar DSL. Because the plumbing language is bespoke, models cannot rely on memorised syntax from training data. They must:
The scenarios are ordered by compositional complexity, from a simple two-agent sequential chain (S1) to a feedback loop with complementary filters and type narrowing (S4). This gradient reveals how models degrade as the required reasoning becomes more involved.
Twenty-five models were evaluated across four scenarios, producing 1000 valid trials. Six models achieved a perfect 40/40; the weakest managed just 1/40. The overall pass rate across all models and scenarios was 77.5% (775/1000).
| Model | S1 | S2 | S3 | S4 | Total |
|---|---|---|---|---|---|
| Claude Opus 4.6 | 100% | 100% | 100% | 100% | 40/40 |
| Claude Sonnet 4.6 | 100% | 100% | 100% | 100% | 40/40 |
| Gemini 3 Flash | 100% | 100% | 100% | 100% | 40/40 |
| GLM-5 | 100% | 100% | 100% | 100% | 40/40 |
| GPT-5.4-mini | 100% | 100% | 100% | 100% | 40/40 |
| MiMo V2 Pro | 100% | 100% | 100% | 100% | 40/40 |
| DeepSeek V3.2 | 100% | 100% | 90% [70–100] | 100% | 39/40 |
| Gemini 3.1 Pro | 100% | 100% | 100% | 90% [70–100] | 39/40 |
| GLM-5 Turbo | 100% | 100% | 90% [70–100] | 100% | 39/40 |
| Qwen 3.5 Plus | 100% | 100% | 90% [70–100] | 100% | 39/40 |
| Gemini 3.1 Flash Lite | 100% | 100% | 90% [70–100] | 90% [70–100] | 38/40 |
| Haiku 4.5 | 100% | 100% | 70% [40–90] | 100% | 37/40 |
| MiniMax M2.7 | 100% | 90% [70–100] | 80% [50–100] | 80% [50–100] | 35/40 |
| Mistral Large 2512 | 60% [30–90] | 80% [50–100] | 100% | 100% | 34/40 |
| Kimi K2.5 | 70% [40–100] | 100% | 70% [40–90] | 90% [70–100] | 33/40 |
| MiMo V2 Flash | 100% | 80% [50–100] | 70% [40–100] | 80% [50–100] | 33/40 |
| Step 3.5 Flash | 90% [70–100] | 80% [50–100] | 90% [70–100] | 70% [40–100] | 33/40 |
| GPT-5.4-nano | 80% [50–100] | 90% [70–100] | 70% [40–100] | 60% [30–90] | 30/40 |
| Grok 4.20 Beta | 50% [20–80] | 70% [40–100] | 80% [50–100] | 70% [40–100] | 27/40 |
| Qwen 3.5 Flash | 70% [40–100] | 80% [50–100] | 50% [20–80] | 50% [20–80] | 25/40 |
| Devstral 2512 | 100% | 80% [50–100] | 40% [10–70] | 10% [0–30] | 23/40 |
| Qwen 3.5-9B | 70% [40–100] | 40% [10–70] | 40% [10–70] | 30% [0–60] | 18/40 |
| Mistral Small 2603 | 50% [20–80] | 10% [0–30] | 0% | 10% [0–30] | 7/40 |
| Nemotron 3 Super | 30% [10–60] | 10% [0–30] | 10% [0–30] | 0% | 5/40 |
| Ministral 14B | 0% | 0% | 10% [0–30] | 0% | 1/40 |
Percentages show the pass rate for 10 trials. Bracketed ranges are bootstrap 95% confidence intervals; models at 100% or 0% have no interval shown. Anthropic and OpenAI models were accessed via their direct APIs; all others via OpenRouter (see Model routing).
The five evaluation scores form a strict cascade. Across all 1000 trials:
| Stage | Pass rate |
|---|---|
| Extraction | 99.4% (994/1000) |
| Parse | 77.9% (779/1000) |
| Typecheck | 77.9% (779/1000) |
| Properties | 77.5% (775/1000) |
| Non-degenerate | 77.5% (775/1000) |
Extraction almost never fails — only 6 trials across three models (Nemotron 3 Super: 4, Qwen 3.5 Flash: 1, MiMo V2 Flash: 1) failed to produce any extractable plumbing text. The critical gate is parsing: programs that parse also typecheck (zero cases of parse success with typecheck failure), and programs that typecheck almost always have the correct structural properties (only 4 exceptions). This means the benchmark effectively measures whether the model can produce syntactically valid plumbing; if it can, type-correctness follows for free.
Per-scenario pass rates across all models confirm the intended difficulty ordering:
| Scenario | Pass rate | Description |
|---|---|---|
| S1 — Basic sequential | 82.8% (207/250) | Two agents, linear chain |
| S2 — Fan-out + custom type | 80.4% (201/250) | Record type, parallel branches |
| S3 — Filter + projection | 73.6% (184/250) | Inline filter, field extraction |
| S4 — Feedback loop | 73.2% (183/250) | Cycle with complementary filters |
S1 and S2 are relatively close; the step down to S3/S4 is where filter syntax and feedback wiring introduce additional complexity. Scenario 3 is the most discriminating among top-tier models: three models that achieve 100% on S1, S2, and S4 drop below 100% only on S3 (Haiku 4.5 at 70%; DeepSeek V3.2, Qwen 3.5 Plus, and GLM-5 Turbo at 90%).
The results cluster into five natural tiers:
Tier 1 — Perfect (100%, 6 models): Claude Opus 4.6, Claude Sonnet 4.6, Gemini 3 Flash, GLM-5, GPT-5.4-mini, MiMo V2 Pro. These models never produced an invalid program across 40 trials each.
Tier 2 — Near-perfect (95–97.5%, 5 models): DeepSeek V3.2, Gemini 3.1 Flash Lite, Gemini 3.1 Pro, Qwen 3.5 Plus, GLM-5 Turbo. One or two failures, typically on S3 or S4.
Tier 3 — Strong (82.5–92.5%, 6 models): Haiku 4.5, MiniMax M2.7, Mistral Large 2512, Kimi K2.5, MiMo V2 Flash, Step 3.5 Flash. Reliable on simple scenarios but showing measurable degradation on harder ones.
Tier 4 — Moderate (62.5–75%, 3 models): GPT-5.4-nano, Grok 4.20 Beta, Qwen 3.5 Flash. Producing valid programs more often than not, but with wide confidence intervals.
Tier 5 — Weak (<60%, 5 models): Devstral 2512, Qwen 3.5-9B, Mistral Small 2603, Nemotron 3 Super, Ministral 14B. Failing on the majority of trials.
Every model family with multiple entries tested showed clear capability scaling from smaller to larger variants:
| Family | Scaling |
|---|---|
| Claude | Haiku 92.5% → Sonnet 100% → Opus 100% |
| GPT | nano 75% → mini 100% |
| Gemini | Flash Lite 95% → Flash 100%, Pro 97.5% |
| Qwen 3.5 | 9B 45% → Flash 62.5% → Plus 97.5% |
| Mistral | Ministral 14B 2.5% → Small 17.5% → Devstral 57.5% → Large 85% |
| GLM | Turbo 97.5% → Full 100% |
| MiMo | Flash 82.5% → Pro 100% |
The Qwen and Mistral families show the steepest gradients — from near total failure at the small end to near-perfect or strong results at the top. The Gemini family is notable for high performance even at the "lite" tier.
Most models degrade monotonically from S1 to S4, but three models show a clearly inverted pattern — performing worse on the supposedly easiest scenario than on harder ones:
| Model | S1 | S3 | S4 |
|---|---|---|---|
| Mistral Large 2512 | 60% | 100% | 100% |
| Grok 4.20 Beta | 50% | 80% | 70% |
| Kimi K2.5 | 70% | 70% | 90% |
Mistral Large is the most striking: perfect on S3 and S4 but only 60% on the basic sequential chain. The reason is unclear — it may relate to how these models handle underspecified prompts (S1 gives less explicit type guidance than the later scenarios), or it may reflect prompt-format sensitivities routed through OpenRouter.
The core finding is encouraging: the majority of current frontier models can reliably generate valid programs in an unfamiliar domain-specific language from a single prompt, given a grammar specification and explanatory documentation. Six out of twenty-five models achieved a perfect score, and eleven scored 95% or above.
The benchmark discriminates effectively between model capabilities. The scenario difficulty gradient works as intended, with filter/ projection syntax (S3) and feedback loops (S4) separating the top-tier models from the rest. The failure mode is almost exclusively at the parse stage — models either produce syntactically valid, type-correct, structurally sound programs, or they produce text that fails to parse. There is essentially no middle ground of "almost right" programs that parse but fail later checks.
This is a one-shot benchmark with no opportunity for self-correction. Allowing models to validate their output against the type checker (tool-assisted generation) is a natural extension that could substantially improve results for models in tiers 3–5.